Press "Enter" to skip to content

Category: Spark

Spark ELT in Synapse Notebooks

Liliam Leme performs some data movement:

I often receive various requests from customers while working on FastTrack projects, and I have compiled some examples to help you build your solution on top of a data lake using useful tips. Most of the examples in this post use pandas, and I hope they will be helpful for you as they were for me.

Please note that all examples in this post use pyspark.

In my scenario, I exported multiple tables from SQLDB to a folder using a notebook and ran the requests in parallel.

Read on for the examples and some of the things you can do with Spark notebooks in Azure Synapse Analytics.

Comments closed

Generating Artificial Data in Databricks

Ben Hazan needs some fake data:

While attending the SQLBits 2023, I took part in André Kamman’s session about “Generate test data quick, easy and lots of it with the Databricks Labs Data Generator”.

In this blog, I will share with you my insights about the DataBricks Data Generator library and I’ll give an example.

Synthetic data is a valuable resource for data scientists, engineers, and analysts who need to test, benchmark, or demonstrate their solutions without compromising sensitive or confidential information. However, generating realistic and relevant synthetic data can be challenging and time-consuming.

That’s why Databricks Labs has developed a Python library called dbldatagen that can help you create large-scale synthetic data sets using Spark.

Click through to learn more about the library and see how you can use to to generate arbitrary amounts of artificial data following certain constraints.

Comments closed

Databricks Power Tools in VS Code

Gerhard Brueckl has some tools for us:

As you probably know, we at paiqo have developed our Databricks extension for VSCode over the last years and are constantly adding new features and improving user experience. The most notable features are probably the execution of local notebooks against a Databricks cluster, a nice UI to manage clusters, jobs, secrets, repos, etc. and last but not least also a browser for your workspace and DBFS to sync files locally.

In February 2023 Databricks also published its own official VSCode extension which was definitely long awaited by a lot of customers (blogextension). It allows you to run a local file on a Databricks cluster and display the results in VSCode again. Alternatively you can also run the code as a workflow. I am sure we can expect much more features in the near future and Databricks investing in local IDE support is already a great step forward!

As you can imagine, I am working very closely with the people at Databricks and we are happy to also announce the next major release of our Databricks VSCode extension 2.0 which now also integrates with the official Databricks extension! To avoid confusion between the two extensions we also renamed ours to Databricks Power Tools so from now on you will see two Databricks icons on the very left bar in VSCode.

Click through to read more in the announcement and some of the things which have changed as a result of version 2.0.

Comments closed

The Legacy of Big Data

Adam Bellemare looks back:

Big Data was going to change the way everything worked. We were about to solve every financial, medical, scientific, and social problem known to humankind. All it would take was a great big pile of data and some way to process it all. 

But somewhere along the line, the big data revolution just sort of petered out, and today you barely hear anything about big data. 

Click through for Adam’s explanation, which is a more detailed form of “Some stuff worked out and became ubiquitous in other ways; others fell off the map.”

But I’m going to snag one more quotation here from Adam:

And finally, big data has shown us that no matter how hard we try, there’s simply no escaping from the inevitable convergence to a full SQL API.

Me: Laughs in Feasel’s Law.

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

1 Comment

Spark Application Dependency Caching

Shu Wang, Biao He, and Minchu Yang talk turkey about dependencies:

In this blog post, we will share our analysis of Spark Dependency Management at LinkedIn, highlight interesting findings, and present our design choice of using a simple user-level cache over more complex alternatives. We will also discuss our rollout experience and lessons learned. Finally, we will demonstrate the impact of accelerating all Spark applications at LinkedIn at the cluster level. Faster Spark jobs translate to increased data freshness, leading to an enhanced member experience by way of more relevant recommendations, timely insights, effective abuse protection, and other similar improvements.

If you work with Spark to any serious extent, you’ll want to read this post.

Comments closed

Unit Testing Spark Notebooks in Synapse

Arun Sethia grabs the oscilloscope:

In this blog post, we will cover how to test and create unit test cases for Spark jobs developed using Synapse Notebook. This is an extension of my previous blog, Synapse – Choosing Between Spark Notebook vs Spark Job Definition, where we discussed selecting between Spark Notebook and Spark Job Definition. Unit testing is an automated approach that developers use to test individual self-contained code units. By verifying code behavior early, it helps to streamline coding practices for larger systems.

Arun covers three major use cases: when your code is in an external library, when it is in a separate notebook, and when it is in the same notebook.

Comments closed

Spark Structured Streaming with Synapse

Ryan Adams builds a demo:

In this post we are going to look at an example of streaming IoT temperature data in Synapse Spark.  I have an IoT device that will stream temperature data from two sensors to IoT hub. We’ll use Synapse Spark to process the data, and finally write the output to persistent storage. Here is what our architecture will look like: 

Click through for the architectural diagram and step-by-step on how to put the demo together.

Comments closed

Parallel Loading in Spark Notebooks

Dustin Vannoy answers some questions:

I received many questions on my tutorial Ingest tables in parallel with an Apache Spark notebook using multithreading. In this video and post I address some of the questions that I couldn’t just answer in the YouTube comments. Watch the video for more complete answers but here are quick responses with links to examples where appropriate.

Click through for the video and some text versions. Dustin includes examples for Synapse and Databricks.

Comments closed

Executing Multiple Notebooks in one Spark Pool with Genie

Shalu Ganotra Chadra, et al, explain what Synapse Genie is:

The Genie framework is a metadata driven utility written in Python. It is implemented using threading (ThreadPoolExecutor module) and directed acyclic graph (Networkx library). It consists of a wrapper notebook, that reads metadata of notebooks and executes them within a single Spark session. Each notebook is invoked on a thread with MSSparkutils.run() command based on the available resources in the Spark pool. The dependencies between notebooks are understood and tracked through a directed acyclic graph.

Read on for more information about how you can use it and what the setup process looks like.

Comments closed