Category: Spark

A Quick Demo: Kafka to Spark Streaming to Cassandra

Published 2020-08-27 by Kevin Feasel

Kundan Kumarr walks us through a simple data pipeline:

Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.
Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system.
Apache Cassandra is a distributed and wide-column NoSQL data store.

As I’m reading through this, I enjoyed just how straightforward the whole process was.

Comments closed

The %tensorboard Magic Command in Databricks Notebooks

Published 2020-08-26 by Kevin Feasel

Jerry Liang and Hossein Falaki introduce a new magic command in Databricks Runtime 7.2:

In 2017, we released the dbutils.tensorboard.start() API to manage and use TensorBoard inside Databricks python notebooks. This API only permits one active TensorBoard process on a cluster at any given time – which hinders multi-tenant use-cases. Early last year, TensorBoard released its own API for notebooks via the %tensorboard python magic command. This API not only starts TensorBoard processes but also exposes the TensorBoard’s command line arguments in the notebook environment. In addition, it embeds the TensorBoard UI inside notebooks, whereas the dbutils.tensorboard.start API prints a link to open TensorBoard in a new tab.

Read on to see how you can use it.

Comments closed

Late-Arriving Data with Spark Streaming

Published 2020-08-24 by Kevin Feasel

Sarfaraz Hussain continues a series on Spark streaming:

The size of the State (discussed in the previous blog) will continue to increase indefinitely as we really don’t know when a bucket can be closed.
But practically a query is not going to receive data 1 week late or in that matter such late-arriving data is of no use to us.
So, to specify the information when to stop considering older buckets for the streaming query we use Watermark.

Read on to see how you can design and implement a watermark.

Comments closed

Creating Widgets in Databricks Notebooks

Published 2020-08-20 by Kevin Feasel

Unmesha Sreeveni shows how you can create a widget in a Databricks notebook:

In order to get some inputs from user we will require widgets in our Azure Databricks notebook.
This blog helps you to create a text based widget in your python notebook.

The syntax is rather similar for Scala as well.

Comments closed

Stateful Stremaing with Spark

Published 2020-08-17 by Kevin Feasel

Sarfaraz Hussain continues a series on Spark Streaming:

Structured Streaming does processing under the hood as micro-batches (default nature).
A state is versioned between micro-batches while the streaming query runs. So as the series of incremental execution plans are generated (discussed in Part 2), each execution plan knows what version of the state it needs to read from.
Each micro-batch reads the previous version of the state data i.e. the previous running count then updates it and creates a new version. Each of these versions gets check-pointed into the same check-point location that we have provided in the query.

Read on to understand the implications of this and what it allows you to do.

Comments closed

An Introduction to Spark Streaming

Published 2020-08-11 by Kevin Feasel

Sarfaraz Hussain has started a series on Spark Streaming. The first post gives an introduction to the topic:

The philosophy behind the development of Structured Streaming is that,
“We as end user should not have to reason about streaming”.
What that means is that we as end-user should only write batch like queries and its Spark’s job to figure out how to run it on a continuous stream of data and continuously update the result as new data flows in.

Sarfaraz then follows this up with a bit on the structure of a streaming query:

So, as new data comes in Spark breaks it into micro batches (based on the Processing Trigger) and processes it and writes it out to the Parquet file.
It is Spark’s job to figure out, whether the query we have written is executed on batch data or streaming data. Since, in this case, we are reading data from a Kafka topic, so Spark will automatically figure out how to run the query incrementally on the streaming data.

Check them both out.

Comments closed

HIVE-6384 Errors with Spark and Parquet

Published 2020-08-06 by Kevin Feasel

Manoj Pandey troubleshoots an issue:

But I was getting following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.UnsupportedOperationException: Parquet does not support decimal. See HIVE-6384

As per the above error it relates to some Hive version conflict, so I tried checking the Hive version by running below command and found that it is pointing to an old version (0.13.0). This version of Hive metastore did not support the BINARY datatypes for parquet formatted files.

Read on to see how Manoj was able to fix the problem in Azure Databricks.

Comments closed

Spark 3.0’s Structured Streaming UI

Published 2020-08-04 by Kevin Feasel

Genamo Yu, et al, show off the Structured Streaming user interface built into Apache Spark 3.0:

When a developer submits a streaming SQL query, it will be listed in the Structured Streaming tab, which includes both active streaming queries and completed streaming queries. Some basic information for streaming queries will be listed in the result table, including query name, status, ID, run ID, submitted time, query duration, last batch ID as well as the aggregate information, like average input rate and average process rate. There are three types of streaming query status, i.e., RUNNING, FINISHED and FAILED. All FINISHED and FAILED queries are listed in the completed streaming query table. The Error column shows the exception details of a failed query.

Read on to learn more.

Comments closed

Spark Director Reader in Hive

Published 2020-07-30 by Kevin Feasel

Anishek Agarwal, et al, announce a new reader for Hive Warehouse Connector:

Apache Hive supports transactional tables which provide ACID guarantees. There has been a significant amount of work that has gone into hive to make these transactional tables highly performant. Apache Spark provides some capabilities to access hive external tables but it cannot access hive managed tables. To access hive managed tables from spark Hive Warehouse Connector needs to be used.
We are happy to announce Spark Direct Reader mode in Hive Warehouse Connector which can read hive transactional tables directly from the filesystem. This feature has been available from CDP-Public-Cloud-2.0 (7.2.0.0) and CDP-DC-7.1 (7.1.1.0) releases onwards.
Hive Warehouse Connector (HWC) was available to provide access to managed tables in hive from spark, however since this involved communication with LLAP there was an additional hop to get the data and process it in spark vs the ability of spark to directly read the data from FileSystem for External tables. This leads to performance degradation in accessing data from managed tables vs external tables. Additionally a lot of use cases for HWC were associated with ETL jobs where a super user was running these jobs to update data in multiple tables hence authorization was not a strong business need for this case. HWC Spark Direct Reader is an additional mode available in HWC which tries to address the above concerns. This article describes the usage of spark direct reader to consume hive transactional table data in a spark application. It also introduces the methods and APIs to read hive transactional tables into spark dataframes. Finally, it demonstrates the transaction handling and semantics while using this reader.

Click through to learn how it works and see it in action.

Comments closed

Dates and Timestamps in Spark 3.0

Published 2020-07-29 by Kevin Feasel

Maxim Gekk, et al, look at the different date and time data types in Apache Spark 3.0:

The definition of a Date is very simple: It’s a combination of the year, month and day fields, like (year=2012, month=12, day=31). However, the values of the year, month and day fields have constraints, so that the date value is a valid day in the real world. For example, the value of month must be from 1 to 12, the value of day must be from 1 to 28/29/30/31 (depending on the year and month), and so on.
These constraints are defined by one of many possible calendars. Some of them are only used in specific regions, like the Lunar calendar. Some of them are only used in history, like the Julian calendar. At this point, the Gregorian calendar is the de facto international standard and is used almost everywhere in the world for civil purposes. It was introduced in 1582 and is extended to support dates before 1582 as well. This extended calendar is called the Proleptic Gregorian calendar.
Starting from version 3.0, Spark uses the Proleptic Gregorian calendar, which is already being used by other data systems like pandas, R and Apache Arrow. Before Spark 3.0, it used a combination of the Julian and Gregorian calendar: For dates before 1582, the Julian calendar was used, for dates after 1582 the Gregorian calendar was used. This is inherited from the legacy java.sql.Date API, which was superseded in Java 8 by java.time.LocalDate, which uses the Proleptic Gregorian calendar as well.

Even in this three-paragraph snippet, you can already get a feeling for how complex working with dates can be. Then throw in the complexities of time and you get a detailed post full of good information.

Comments closed