Category: Spark

Diving into Spark Streaming

Published 2021-12-20 by Kevin Feasel

Tomaz Kastrun continues a series on Spark and is well into a section on Spark Streaming. Part 17 looks at watermarks:

Streaming data is considered as continuously ingested data with particular frequency and latency. It is considered “big data” and data that has no discrete beginning nor end.
The primary goal of any real-time stream processing system is to process the streaming data within a window frame (or considered this as frequency). Usually this frequency is “as soon as it arrives”. On the other hand, latency in streaming processing model is considered to have the means to work or deal with all the possible latencies (one second or one minute) and provides an end-to-end low latency system. If frequency of data analysing is on user’s side (destination), latency is considered on the device’s side (source).

Part 18 enumerates the supported types of windows:

Tumbling windows are fixed sized and static. They are non-overlapping and are contiguous intervals. Every ingested data can be (must be) bound to a singled window.
Sliding windows are also fixed sized and also static. Windows will overlap when the duration of the slide is smaller than the duration of the window. Ingested data can therefore be bound to two or more windows
Session windows are dynamic in size of the window length. The size depends on the ingested data. A session starts with an input and expands if the following input expands if the next ingested record has fallen within the gap duration.

Part 19 includes good information on how data engineers can work with streams of data:

Streaming data can be used in conjunction with other datasets. You can have Joining streaming data, joining data with watermarking, deduplication, outputting the data, applying foreach logic, using triggers and creating Stream API Tables.
All of the functions are available in Python, Scala and Java and some are not available with R. We will be focusing on Python and R.

Check out all three of these posts.

Comments closed

Trying out Spark Streaming

Published 2021-12-17 by Kevin Feasel

Tomaz Kastrun continues a series on Spark. Part 15 provides an introduction to Spark Streaming:

Spark Streaming or Structured Streaming is a scalable and fault-tolerant, end-to-end stream processing engine. it is built on the Spark SQL engine. Spark SQL engine will is responsible for running results sets for streaming data, regardless of static or continuously in coming stream data.
Spark stream can use Dataframe (or Datasets) API in Scala, Python, R or Java to work on handling data ingest, creating streaming analytics and do all the computations. All these requests and workloads are done against Spark SQL engine.

I don’t think I’ve ever seen an example of using Spark Streaming in R, so that one’s new to me.

Part 16 looks at DataFrame operations in Spark Streaming:

When working with Spark Streaming from file based ingestion, user must predefine the schema. This will require not only better performance but consistent data ingest for streaming data. There is always possibility to set the spark.sql.streaming.schemaInference to true to enable Spark to infer schema on read or automatically.

Check out both of those posts.

Comments closed

Spark SQL Bucketing and Query Tuning

Published 2021-12-15 by Kevin Feasel

Tomaz Kastrun continues a series on Apache Spark. Part 13 looks at bucketing and partitioning in Spark SQL:

Partitioning and Bucketing in Hive are used to improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file system (HDFS). The major difference between them is how they split the data.

Part 14 covers query hints:

This hint instructs Spark to use the hinted strategy on specified relation when joining tables together. When BROADCASTJOIN hint is used on Data1 table with Data2 table and overrides the suggested setting of statistics from configuration spark.sql.autoBroadcastJoinThreshold.
Spark also prioritise the join strategy, and also when different JOIN strategies are used, Spark SQL will always prioritise them.

Be sure to check those out.

Comments closed

Spark DataFrames

Published 2021-12-14 by Kevin Feasel

Tomaz Kastrun continues a series on working with Apache Spark. Part 10 looks at the DataFrame construct:

We have looked in datasets and seen that a dataset is distributed collection of data. A dataset can be constructer from JVM object and later manipulated with transformation operations (e.g.: filter(), Map(),…). API for these datasets are also available in Scala and in Java. But in both cases of Python and R, you can also access the columns or rows from datasets.
On the other hand, dataframe is organised dataset with named columns. It offers much better optimizations and computations and still resembles a typical table (as we know it from database world). Dataframes can be constructed from arrays or from matrices from variety of files, SQL tables, and datasets (RDDs). Dataframe API is available in all flavours: Java, Scala, R and Python and hence it’s popularity.

Part 11 looks at external R and Python packages and DataFrame support:

When you install Spark, the extension of not only languages but also other packages, systems is huge. For example with R, not only that you can harvest the capabilities of distributed and parallel computations, you can also extend the use of R language.

Part 12 gets into Spark SQL:

Spark SQL is a one of the Spark modules for structured data processing and analysing. Spark provides Spark SQL and also API for execution of SQL queries. Spark SQL can read data from Hive instance, but also from datasets and dataframe. The communication between Spark SQL and execution engine will always result in a dataset or datafrane.
These formats are interchangeable. So interacting with SQL against result from a different API is possible, respectively. Plugging in the Java JDBD or standard ODBC drivers will also give your SQL interface access to different sources. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.
With API unification, user can access Spark SQL using Scala spark-shell, using Python pyspark or using R sparkR shell.

DataFrames are so popular that they’ve become the de facto standard for working with data in Spark, and .NET languages only work with DataFrames, not with the raw RDDs.

Comments closed

Learning about RDDs in Spark

Published 2021-12-13 by Kevin Feasel

Tomaz Kastrun continues a series on Spark. Part 7 ties in R and gives us sample plotting in R and Python:

Let’s look into the local use of Spark. For R language, sparklyr package is availble and for Python pyspark is availble.

Part 8 gets us into the key data structure behind Spark’s success, the Resilient Distributed Dataset:

Spark is created around the concept of resilient distributed datasets (RDD). RDD is a fault-tolerant collection of files that can be used in parallel. RDDs can be created in two ways:
– parallelising an existing data collection in driver program
– referencing a datasets in external storage (HDFS, blob storage, shared filesystem, Hadoop InputFormat,…)
In a simple way, Spark RDD has two opeartions:
– transformations – creating a new RDD dataset on top of already existing one with the last transformation
– actions – to the action, and return a value to the driver program after running a computation on the dataset.

Part 9 looks a bit more at transformations and actions:

Two types of operations are available with RDD; transformations and actions. Transformations are lazy operations, meaning that they prepare the new RDD with every new operation but now show or return anything. We can say, that transformations are lazy because of updating existing RDD, these operations create another RDD. Actions on the other hand trigger the computations on RDD and show (return) the result of transformations.

Most modern work in Spark won’t directly use RDDs, though everything is built on top of them and it’s good to understand the foundation even if you don’t need to write all of those map(), fold(), and reduceByKey() operations yourself.

Comments closed

Deploying dbt on Databricks

Published 2021-12-09 by Kevin Feasel

Dave Eyler, et al, have a great announcement:

At Databricks, nothing makes us happier than making our users more productive, which is why we are delighted to announce a native adapter for dbt. It’s now easier than ever to develop robust data pipelines on Databricks using SQL.
dbt is a popular open source tool that lets a new breed of ‘analytics engineer’ build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple.

Click through for more information on how this works and how you can get the native adapter.

Comments closed

IDE Options for Spark

Published 2021-12-07 by Kevin Feasel

Tomaz Kastrun shares some thoughts on IDEs for writing Spark code:

Let’s look into the IDE that can be used to run Spark.
Remember that Spark can be used with languages: Scala, Java, R, Python and each give you different IDE and different installations.

To Tomaz’s list, I’d add IntelliJ IDEA for Scala programming. I was never much a fan of Eclipse, but there are people who swear by it.

Comments closed

Using Scala at Databricks

Published 2021-12-07 by Kevin Feasel

Li Haoyi gives us a peek behind the curtain:

With hundreds of developers and millions of lines of code, Databricks is one of the largest Scala shops around. This post will be a broad tour of Scala at Databricks, from its inception to usage, style, tooling and challenges. We will cover topics ranging from cloud infrastructure and bespoke language tooling to the human processes around managing our large Scala codebase. From this post, you’ll learn about everything big and small that goes into making Scala at Databricks work, a useful case study for anyone supporting the use of Scala in a growing organization.

It’s always interesting to see how the largest companies handle certain classes of problems. From this post, we can get an idea of the high-level requirements and usage, making it worth the read.

Comments closed

Getting Around in Spark

Published 2021-12-06 by Kevin Feasel

Tomaz Kastrun continues a series on Apache Spark. Day 3 shows off the CLI and web UI:

In CLI we will type and run a simple Scala script and observe the behaviour in the WEB UI.
We will read the text file into RDD (Resilient Distributed Dataset). Spark engine resides on location:
/usr/local/Cellar/apache-spark/3.2.0 for MacOS and
C:\SparkApp\spark-3.2.0-bin-hadoop3.2 for Windows (based on the blogpost from Dec.1)

Day 4 compares local mode versus cluster mode:

Finding the best way to write Spark will be dependent of the language flavour. As we have mentioned, Spark runs both on Windows and Mac OS or Linux (both UNIX-like systems). And you will need Java installed to run the clusters. Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. And the language flavour can also determine which IDE will be used.

Day 5 shows the setup of a Spark cluster:

Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment. If you decide to use Hadoop and YARN, there is usually the installation needed to install everything on nodes. Installing Java, JavaJDK, Hadoop and setting all the needed configuration. This installation is preferred when installing several nodes. A good example and explanation is available here. you will also be installing HDFS that comes with Hadoop.

Check out all three posts and get caught up on Spark.

Comments closed

Installing Apache Spark

Published 2021-12-03 by Kevin Feasel

Tomaz Kastrun continues a series on Apache Spark:

Installing Apache Spark on Windows computer will require preinstalled Java JDK (Java Development Kit). Java 8 or later version, with current version 17. On Oracle website, download the Java and install it on your system. Easiest way is to download the x64 MSI Installer. Install the file and follow the instructions. Installer will create a folder like “C:\Program Files\Java\jdk-17.0.1”.

Read on for instructions for both Windows and MacOS. You can also create a container running Spark, which is another helpful method.

Comments closed