Press "Enter" to skip to content

Category: Spark

Develop BDC PySpark Jobs in Visual Studio Code

Jenny Jiang announces a new capability in Visual Studio Code:

With the Visual Studio Code extension, you can enjoy native Python programming experiences such as linting, debugging support, language service, and so on. You can run current linerun selected lines of code, or run all for your PY file. You can import and export a .ipynb notebook and perform a notebook like query including Run Cell, Run Above, or Run Below. You can also enjoy a notebook like interactive experience that includes your source code and markdown comments along with the running results and output. You can remove the unneeded sections, enter comments, or type additional code in the interactive results window. Moreover, you can visualize your results in a graphic format through a matplotlib like Jupyter Notebook. The integration with SQL Server 2019 Big Data Clusters empowers you to quickly submit a PySpark batch job to the big data cluster and monitor job progress.

This is rather useful for developers, though I greatly prefer the Azure Data Studio notebook interface.

Comments closed

Null Checks in Spark DataFrames

Bipin Patwardhan gives us four techniques for validating whether data in Spark exists:

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column ‘null’). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named “df.”

Click through for those techniques.

Comments closed

Paired RDDs in Spark

Ramandeep Kaur explains how Paired Resilient Distributed Datasets (PairRDDs) differ from regular RDDs:

So, assuming that you have a fair idea about what Spark is and the basics of RDDs. Paired RDD is one of the kinds of RDDs. These RDDs contain the key/value pairs of data. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.

When datasets are described in terms of key/value pairs, it is common to want to aggregate statistics across all elements with the same key.

Paired RDDs bring us back to that key-value pair paradigm which Hadoop’s version of MapReduce brought to the forefront.

Comments closed

Checking Spark Config on Windows

Ed Elliott has a Powershell script to tell you if your Spark configuration on Windows is incorrect:

There are some pretty common mistakes people make (myself included!), most common I have seen recently have been having a semi-colon in JAVA_HOME/SPARK_HOME/HADOOP_HOME or having HADOOP_HOME not point to a directory with a bin folder which contains winutils.

To help, I have written a small powershell script that a) validates that the setup is correct and then b) runs one of the spark examples to prove that everything is setup correctly.

Click through for the script.

Comments closed

Working with Tables in Databricks

Brad Llewellyn shows us how to build tables (temporary and permanent) and views in Azure Databricks using each of the main languages:

Simply put, an External Table is a table built directly on top of a folder within a data source.  This means that the data is not hidden away in some proprietary SQL format.  Instead, the data is completely accessible to outside systems in its native format.  The main reason for this is that it gives us the ability to create “live” queries on top of text data sources.  Every time a query is executed against the table, the query is run against the live data in the folder.  This means that we don’t have to run ETL jobs to load data into the table.  Instead, all we need to do is put the structured files in the folder and the queries will automatically surface the new data.

Each language has its own way of doing things, but they all use the Hive metastore under the covers.

Comments closed

Configuring an Azure Databricks Cluster

Jon Gurgul explains cluster settings in Azure Databricks:

We need compute to run our notebooks and this is achieved by creating a cluster. A cluster is merely a number of Virtual Machines behind the scenes used to form this compute resource. The benefit of Azure Databricks is that compute is only chargeable when on.

Let’s now click the Clusters icon and set up a simple cluster. Once you have loaded the page you can use the “Create Cluster” button.

Click through for an explanation of what each of the settings means.

Comments closed

The Transaction Log in Delta Tables

Burak Yavuz, et al, explain how the transaction log works with Delta Tables in Apache Spark:

When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with 000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as 000001.json, the following as 000002.json, and so on.

It’s interesting that they chose JSON instead of a binary transaction log like relational databases use.

Comments closed

Spark for .NET Developers

Ed Elliott has a long-form post covering spark-dotnet:

The .NET driver is made up of two parts, and the first part is a Java JAR file which is loaded by Spark and then runs the .NET application. The second part of the .NET driver runs in the process and acts as a proxy between the .NET code and .NET Java classes (from the JAR file) which then translate the requests into Java requests in the Java VM which hosts Spark.

The .NET driver is added to a .NET program using NuGet and ships both the .NET library as well as two Java jars. One jar is for Spark 2.3 and one for Spark 2.4, and you do need to use the correct one on your installed version of Scala.

As much as I’ve enjoyed his series, getting it in a single-post format is great.

Comments closed

Debugging Spark Applications in Visual Studio

Ed Elliott continues a series on spark-dotnet:

There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached – think apps which spawn other processes and they fail in the startup routine. You can add a Debugger.Launch() to your program then when spark executes it, a prompt will be displayed and you can attach Visual Studio to your program. (as an aside I used to do this manually a lot by writing an __asm int 3 into an app to get it to break at an appropriate point, great memories but we don’t need to do that anymore luckily :).

The second approach is to start the spark-dotnet driver in debug mode which instead of launching your app, it starts and listens for incoming requests – you can then run your program as normal (f5), set a breakpoint and your breakpoint will be hit.

Read on to see how it’s done, as well as a possibly-accidental benefit to this.

Comments closed

Performance Tuning Neural Network Training

Sean Owen takes us through a few techniques for speeding up neural network model training:

Step #2: Use Early Stopping
Keras (and other frameworks) have built-in support for stopping when further training appears to be making the model worse. In Keras, it’s the EarlyStopping callback. Using it means passing the validation data to the training process for evaluation on every epoch. Training will stop after several epochs have passed with no improvement. restore_best_weights=True ensures that the final model’s weights are from its best epoch, not just the last one. This should be your default.

Sean focuses here on Keras + TensorFlow on Spark, but several of the tips are cross-product and generally applicable.

Comments closed