Press "Enter" to skip to content

Category: Spark

The Transaction Log in Delta Tables

Burak Yavuz, et al, explain how the transaction log works with Delta Tables in Apache Spark:

When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with 000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as 000001.json, the following as 000002.json, and so on.

It’s interesting that they chose JSON instead of a binary transaction log like relational databases use.

Comments closed

Spark for .NET Developers

Ed Elliott has a long-form post covering spark-dotnet:

The .NET driver is made up of two parts, and the first part is a Java JAR file which is loaded by Spark and then runs the .NET application. The second part of the .NET driver runs in the process and acts as a proxy between the .NET code and .NET Java classes (from the JAR file) which then translate the requests into Java requests in the Java VM which hosts Spark.

The .NET driver is added to a .NET program using NuGet and ships both the .NET library as well as two Java jars. One jar is for Spark 2.3 and one for Spark 2.4, and you do need to use the correct one on your installed version of Scala.

As much as I’ve enjoyed his series, getting it in a single-post format is great.

Comments closed

Debugging Spark Applications in Visual Studio

Ed Elliott continues a series on spark-dotnet:

There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached – think apps which spawn other processes and they fail in the startup routine. You can add a Debugger.Launch() to your program then when spark executes it, a prompt will be displayed and you can attach Visual Studio to your program. (as an aside I used to do this manually a lot by writing an __asm int 3 into an app to get it to break at an appropriate point, great memories but we don’t need to do that anymore luckily :).

The second approach is to start the spark-dotnet driver in debug mode which instead of launching your app, it starts and listens for incoming requests – you can then run your program as normal (f5), set a breakpoint and your breakpoint will be hit.

Read on to see how it’s done, as well as a possibly-accidental benefit to this.

Comments closed

Performance Tuning Neural Network Training

Sean Owen takes us through a few techniques for speeding up neural network model training:

Step #2: Use Early Stopping
Keras (and other frameworks) have built-in support for stopping when further training appears to be making the model worse. In Keras, it’s the EarlyStopping callback. Using it means passing the validation data to the training process for evaluation on every epoch. Training will stop after several epochs have passed with no improvement. restore_best_weights=True ensures that the final model’s weights are from its best epoch, not just the last one. This should be your default.

Sean focuses here on Keras + TensorFlow on Spark, but several of the tips are cross-product and generally applicable.

Comments closed

Machine Learning and Delta Lake

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake:

As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to the data so that it can be fed into machine learning models, these data lakes can quickly become convoluted, unorganized messes that have given rise to the term “data swamps.”

Before a single data point has been transformed or analyzed, data engineers have already run into their first dilemma: how to bring together processing of historical (“batch”) data, and real-time streaming data. Traditionally, one might use a lambda architecture to bridge this gap, but that presents problems of its own stemming from lambda’s complexity, as well as its tendency to cause data loss or corruption.

Read the whole thing.

Comments closed

The Databricks File System

Brad Llewellyn takes us through the Azure Databricks File System:

Today, we’re going to talk about the Databricks File System (DBFS) in Azure Databricks.  If you haven’t read the previous posts in this series, IntroductionCluster Creation and Notebooks, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let’s move on to the core of this post, DBFS.

As we mentioned in the previous post, there are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  For this post, we’re going to talk about the storage layer underneath Azure Databricks, DBFS.  Since Azure Databricks manages Spark clusters, it requires an underlying Hadoop Distributed File System (HDFS).  This is exactly what DBFS is.  Basically, HDFS is the low cost, fault-tolerant, distributed file system that makes the entire Hadoop ecosystem work.  We may dig deeper into HDFS in a later post.  For now, you can read more about HDFS here and here.

Click through for more detail on DBFS.

Comments closed

Parsing Rows Manually with Spark .NET

Ed Elliott shows how we can solve a challenging problem when newlines are in the wrong place:

So the first thing we need to do is to read in the whole file in one chunk, if we just do a standard read the file will get broken into rows based on the newline character:

var file = spark.Read().Option("wholeFile", true).Text(@"C:\git\files\newline-as-data.txt");

This solution is a bit complex. As Ed points out, you’re better off reshaping the file before you try to process it. If it’s a structured file like the example Ed has, a regular expression can do the trick.

Comments closed

Dimensional Load with Databricks

Leo Furlong shows how we can load an Azure SQL Data Warehouse dimension with Databricks:

Ingesting data into the Data Lake occurs in steps 1 and 2 in our architecture.  Azure Data Factory (ADF) provides an excellent mechanism for loading data from source applications into a Data Lake stored in Azure Data Lake Store Gen2.  In fact, Microsoft offers a template in the ADF Template gallery which provides a metadata driven approach for doing so.  The template comes with a control table example in a SQL Server Database, a data source dataset and a data destination dataset.  More on this template can be found here in the official documentation.

I appreciate that this is a full walkthrough of the process, not just one step.

Comments closed

ClassNotFoundException and .NET Spark

Ed Elliott takes us through two causes for a ClassNotFoundException when running a Spark job with .NET Spark:

There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark.

To fix the issue you need to use the new package name which adds an extra dotnet near the end, change:

spark-submit --class org.apache.spark.deploy.DotnetRunner

Click through to see what you should change this line of code to read. If that change doesn’t fix your problem, Ed has a broader solution.

Comments closed

Deploying a Big Data Cluster

Mohammad Darab takes us through the Big Data Cluster deployment process using Azure Data Studio:

I’ve been “playing around” with Big Data Clusters for some time now and CTP 3.2 is way ahead when it comes to streamlining the BDC deployment process. You can check out my 4-part series on deploying BDC on AKS to see how cumbersome the process used to be. New in CTP 3.2, you can deploy a BDC on AKS (an existing cluster OR a new cluster) using an Azure Data Studio notebook. Let’s see how.

Click through for instructions. It was rather smart of Microsoft to release the instructions as a notebook.

Comments closed