Spark – Page 43 – Curated SQL

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column ‘null’). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named “df.”

Click through for those techniques.

Comments closed

Paired RDDs in Spark

Published 2019-09-04 by Kevin Feasel

Ramandeep Kaur explains how Paired Resilient Distributed Datasets (PairRDDs) differ from regular RDDs:

So, assuming that you have a fair idea about what Spark is and the basics of RDDs. Paired RDD is one of the kinds of RDDs. These RDDs contain the key/value pairs of data. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.
When datasets are described in terms of key/value pairs, it is common to want to aggregate statistics across all elements with the same key.

Paired RDDs bring us back to that key-value pair paradigm which Hadoop’s version of MapReduce brought to the forefront.

Comments closed

Checking Spark Config on Windows

Published 2019-08-29 by Kevin Feasel

Ed Elliott has a Powershell script to tell you if your Spark configuration on Windows is incorrect:

There are some pretty common mistakes people make (myself included!), most common I have seen recently have been having a semi-colon in JAVA_HOME/SPARK_HOME/HADOOP_HOME or having HADOOP_HOME not point to a directory with a bin folder which contains winutils.
To help, I have written a small powershell script that a) validates that the setup is correct and then b) runs one of the spark examples to prove that everything is setup correctly.

Click through for the script.

Comments closed

Working with Tables in Databricks

Published 2019-08-27 by Kevin Feasel

Brad Llewellyn shows us how to build tables (temporary and permanent) and views in Azure Databricks using each of the main languages:

Simply put, an External Table is a table built directly on top of a folder within a data source. This means that the data is not hidden away in some proprietary SQL format. Instead, the data is completely accessible to outside systems in its native format. The main reason for this is that it gives us the ability to create “live” queries on top of text data sources. Every time a query is executed against the table, the query is run against the live data in the folder. This means that we don’t have to run ETL jobs to load data into the table. Instead, all we need to do is put the structured files in the folder and the queries will automatically surface the new data.

Each language has its own way of doing things, but they all use the Hive metastore under the covers.

Comments closed

Configuring an Azure Databricks Cluster

Published 2019-08-26 by Kevin Feasel

Jon Gurgul explains cluster settings in Azure Databricks:

We need compute to run our notebooks and this is achieved by creating a cluster. A cluster is merely a number of Virtual Machines behind the scenes used to form this compute resource. The benefit of Azure Databricks is that compute is only chargeable when on.
Let’s now click the Clusters icon and set up a simple cluster. Once you have loaded the page you can use the “Create Cluster” button.

Click through for an explanation of what each of the settings means.

Comments closed

The Transaction Log in Delta Tables

Published 2019-08-23 by Kevin Feasel

Burak Yavuz, et al, explain how the transaction log works with Delta Tables in Apache Spark:

When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with 000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as 000001.json, the following as 000002.json, and so on.

It’s interesting that they chose JSON instead of a binary transaction log like relational databases use.

Comments closed

Spark for .NET Developers

Published 2019-08-23 by Kevin Feasel

Ed Elliott has a long-form post covering spark-dotnet:

The .NET driver is made up of two parts, and the first part is a Java JAR file which is loaded by Spark and then runs the .NET application. The second part of the .NET driver runs in the process and acts as a proxy between the .NET code and .NET Java classes (from the JAR file) which then translate the requests into Java requests in the Java VM which hosts Spark.
The .NET driver is added to a .NET program using NuGet and ships both the .NET library as well as two Java jars. One jar is for Spark 2.3 and one for Spark 2.4, and you do need to use the correct one on your installed version of Scala.

As much as I’ve enjoyed his series, getting it in a single-post format is great.

Comments closed

Debugging Spark Applications in Visual Studio

Published 2019-08-22 by Kevin Feasel

Ed Elliott continues a series on spark-dotnet:

There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached – think apps which spawn other processes and they fail in the startup routine. You can add a Debugger.Launch() to your program then when spark executes it, a prompt will be displayed and you can attach Visual Studio to your program. (as an aside I used to do this manually a lot by writing an __asm int 3 into an app to get it to break at an appropriate point, great memories but we don’t need to do that anymore luckily :).
The second approach is to start the spark-dotnet driver in debug mode which instead of launching your app, it starts and listens for incoming requests – you can then run your program as normal (f5), set a breakpoint and your breakpoint will be hit.

Read on to see how it’s done, as well as a possibly-accidental benefit to this.

Comments closed

Performance Tuning Neural Network Training

Published 2019-08-20 by Kevin Feasel

Sean Owen takes us through a few techniques for speeding up neural network model training:

Step #2: Use Early Stopping
Keras (and other frameworks) have built-in support for stopping when further training appears to be making the model worse. In Keras, it’s the EarlyStopping callback. Using it means passing the validation data to the training process for evaluation on every epoch. Training will stop after several epochs have passed with no improvement. restore_best_weights=True ensures that the final model’s weights are from its best epoch, not just the last one. This should be your default.

Sean focuses here on Keras + TensorFlow on Spark, but several of the tips are cross-product and generally applicable.

Comments closed

Machine Learning and Delta Lake

Published 2019-08-19 by Kevin Feasel

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake:

As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to the data so that it can be fed into machine learning models, these data lakes can quickly become convoluted, unorganized messes that have given rise to the term “data swamps.”
Before a single data point has been transformed or analyzed, data engineers have already run into their first dilemma: how to bring together processing of historical (“batch”) data, and real-time streaming data. Traditionally, one might use a lambda architecture to bridge this gap, but that presents problems of its own stemming from lambda’s complexity, as well as its tendency to cause data loss or corruption.

Read the whole thing.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark

Null Checks in Spark DataFrames

Paired RDDs in Spark

Checking Spark Config on Windows

Working with Tables in Databricks

Configuring an Azure Databricks Cluster

The Transaction Log in Delta Tables

Spark for .NET Developers

Debugging Spark Applications in Visual Studio

Performance Tuning Neural Network Training

Machine Learning and Delta Lake