Spark – Page 20 – Curated SQL

Databricks Delta Sharing for Azure

Published 2022-01-25 by Kevin Feasel

Will Girten, et al, announce Delta Sharing on Azure:

Included in this release is a new and improved API for listing all the tables under all schemas in a share. The new API supports pagination similar to other APIs.
For example, to list all the tables in the Delta share my_share, you can simply send a GET request to the /shares/{share_name}/all-tables endpoint on the sharing server.

Prior to that, you might want to read up on Delta Sharing.

Comments closed

Using Synapse Link for Cosmos DB

Published 2022-01-25 by Kevin Feasel

I have a post combining Synapse Link for Cosmos DB and the Spark to Synapse SQL Connector:

In this post, we saw how to enable Cosmos DB’s Analytical store, access data using Synapse Link for Cosmos DB, and use the Spark to Synapse SQL Connector to move that data into a dedicated SQL pool. We saw how to do this in a workspace using a managed virtual network with data exfiltration protection enabled, meaning this is the largest number of steps necessary.

Click through for product descriptions and step-by-step instructions.

Comments closed

MLOps on Databricks

Published 2022-01-17 by Kevin Feasel

Piotr Majer and Michael Shtelma complete a series on MLOps on Databricks:

This is the second part of a two-part series of blog posts that show an end-to-end MLOps framework on Databricks, which is based on Notebooks. In the first post, we presented a complete CI/CD framework on Databricks with notebooks. The approach is based on the Azure DevOps ecosystem for the Continuous Integration (CI) part and Repos API for the Continuous Delivery (CD). This post extends the presented CI/CD framework with machine learning providing a complete ML Ops solution.

Check it out.

Comments closed

Combining Azure DevOps and Databricks

Published 2022-01-07 by Kevin Feasel

Anna Wykes continues a series on DevOps for Databricks:

An Environment Variable is a variable stored outside of the Python script; in our instance it will be stored on the DevOps Agent running the DevOps Pipelines. Consequently, it is accessible to other scripts/programs running on the DevOps Agent. We will not cover DevOps Agents in this blog specifically, the simplest description is that they are the compute that runs your pipeline, normally a VM (Virtual Machine) or Docker Container

Read the whole thing.

Comments closed

Checking if a Spark DataFrame is Empty

Published 2021-12-31 by Kevin Feasel

The Hadoop in Real World team has a one-liner for us:

A quick answer that might come to your mind is to call the count() function on the dataframe and check if the count is greater than 0. count() on a dataframe with a lot of records is super inefficient.
count() will do a global count of records in the dataframe from all partitions and then add all the intermediate counts together to get the final count. You will find this approach very slow for big dataframes.

Click through for a much faster one-liner.

Comments closed

Wrapping up a Spark Advent Calendar

Published 2021-12-28 by Kevin Feasel

Tomaz Kastrun did it: 25 posts in 25 days on Spark. Part 23 looks at Delta Live Tables:

Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. User defines the transformations to be performed on the datasources and data, and the framework manages all the data engineering tasks: task orchestrations, cluster management, monitoring, data quality, and event error handling.
Delta Live Tables framework helps and manages how data is being transformed with help of target schema and can is a slight different experience with Databricks Tasks (with Apache Spark tasks in the background).

Part 24 takes us through a bit of visualization:

You can use any of the popular Python packages to do the visualisation; Plotly, Dash, Seaborn, Matplotlib, Bokeh, Leather, Glam, to name the couple and many others. Once the data is persisted in dataframe, you can use any of the packages. With the use of PySpark, plugin the Matplotlib. Here is an example

And part 25 wraps things up with links to additional resources:

To wrap up this year’s Advent of Spark 2021 – series of blogposts on Spark – it is essential to look at the list of additional learning resources for you to continue with this journey. Let’s divide this list not by type of the resource (book, on-line documentation, on-line courses, articles, Youtube channels, Discord channels, and others) but rather divide them by language flavour. Scala/Spark, R, and Python.

Great job on Tomaz’s part for gutting it out.

Comments closed

Spark in Azure Databricks

Published 2021-12-23 by Kevin Feasel

Tomaz Kastrun starts winding down a series on Apache Spark. Part 22 covers Spark in Azure Databricks:

Azure Databricks is a platform build on top of Spark based analytical engine, that unifies data, data manipulation, analytics and machine learning.
Databricks uses notebooks to tackle all the tasks and is therefore made easy to collaborate. Let’s dig in and start using a Python API on top of Spark API.

Read on for that primer.

Comments closed

Working with GraphX in Spark

Published 2021-12-22 by Kevin Feasel

Tomaz Kastrun continues a series on Spark with a look at GraphX. Part 20 gives an overview of GraphX:

GraphX is Spark’s API component for graph and graph-parallel computations. GraphX uses Spark RDD and builds a graph abstraction on top of RDD. Graph abstraction is a directed multigraph with properties of edges and vertices.

Part 21 shows off the operators available:

Property graphs have collection of operators, that can take user-defined function and produce new graphs with transformed properties and structure. Core operators are defined in Graph and compositions of core operators are defined as GraphOps, and are automatically available as members of Graph. Each graph representation must provide implementations of the core operations and reuse many of the useful operations that are defined in GraphOps.

Click through for more information on graphs in the Spark ecosystem.

Comments closed

DevOps for Databricks

Published 2021-12-22 by Kevin Feasel

Anna Wykes starts off with bad news:

In this blog series I explore a variety of options available for DevOps for Databricks. This blog will focus on working with the Databricks REST API & Python. Why you ask? Well, a large percentage of Databricks/Spark users are Python coders. In fact, in 2021 it was reported that 45% of Databricks users use Python as their language of choice. This is a stark contrast to 2013, in which 92 % of users were Scala coders:

What is wrong with the world today?

Semi-seriously, though, do read Anna’s post, as it covers a variety of things you can do with the Databricks REST API, including cluster management and monitoring. I might be jumping the gun a bit, but I am a big fan of Gerhard Brueckl’s Powershell module for Databricks for this kind of work.

Comments closed

Diving into Spark Streaming

Published 2021-12-20 by Kevin Feasel

Tomaz Kastrun continues a series on Spark and is well into a section on Spark Streaming. Part 17 looks at watermarks:

Streaming data is considered as continuously ingested data with particular frequency and latency. It is considered “big data” and data that has no discrete beginning nor end.
The primary goal of any real-time stream processing system is to process the streaming data within a window frame (or considered this as frequency). Usually this frequency is “as soon as it arrives”. On the other hand, latency in streaming processing model is considered to have the means to work or deal with all the possible latencies (one second or one minute) and provides an end-to-end low latency system. If frequency of data analysing is on user’s side (destination), latency is considered on the device’s side (source).

Part 18 enumerates the supported types of windows:

Tumbling windows are fixed sized and static. They are non-overlapping and are contiguous intervals. Every ingested data can be (must be) bound to a singled window.
Sliding windows are also fixed sized and also static. Windows will overlap when the duration of the slide is smaller than the duration of the window. Ingested data can therefore be bound to two or more windows
Session windows are dynamic in size of the window length. The size depends on the ingested data. A session starts with an input and expands if the following input expands if the next ingested record has fallen within the gap duration.

Part 19 includes good information on how data engineers can work with streams of data:

Streaming data can be used in conjunction with other datasets. You can have Joining streaming data, joining data with watermarking, deduplication, outputting the data, applying foreach logic, using triggers and creating Stream API Tables.
All of the functions are available in Python, Scala and Java and some are not available with R. We will be focusing on Python and R.

Check out all three of these posts.

Comments closed

Category: Spark