Press "Enter" to skip to content

Category: Spark

An Introduction to Spark Streaming

Sarfaraz Hussain has started a series on Spark Streaming. The first post gives an introduction to the topic:

The philosophy behind the development of Structured Streaming is that,

We as end user should not have to reason about streaming”.

What that means is that we as end-user should only write batch like queries and its Spark’s job to figure out how to run it on a continuous stream of data and continuously update the result as new data flows in.

Sarfaraz then follows this up with a bit on the structure of a streaming query:

So, as new data comes in Spark breaks it into micro batches (based on the Processing Trigger) and processes it and writes it out to the Parquet file.

It is Spark’s job to figure out, whether the query we have written is executed on batch data or streaming data. Since, in this case, we are reading data from a Kafka topic, so Spark will automatically figure out how to run the query incrementally on the streaming data.

Check them both out.

Comments closed

HIVE-6384 Errors with Spark and Parquet

Manoj Pandey troubleshoots an issue:

But I was getting following error:

warning: there was one feature warning; re-run with -feature for details
java.lang.UnsupportedOperationException: Parquet does not support decimal. See HIVE-6384

 
As per the above error it relates to some Hive version conflict, so I tried checking the Hive version by running below command and found that it is pointing to an old version (0.13.0). This version of Hive metastore did not support the BINARY datatypes for parquet formatted files.

Read on to see how Manoj was able to fix the problem in Azure Databricks.

Comments closed

Spark 3.0’s Structured Streaming UI

Genamo Yu, et al, show off the Structured Streaming user interface built into Apache Spark 3.0:

When a developer submits a streaming SQL query, it will be listed in the Structured Streaming tab, which includes both active streaming queries and completed streaming queries. Some basic information for streaming queries will be listed in the result table, including query name, status, ID, run ID, submitted time, query duration, last batch ID as well as the aggregate information, like average input rate and average process rate. There are three types of streaming query status, i.e., RUNNINGFINISHED and FAILED. All FINISHED and FAILED queries are listed in the completed streaming query table. The Error column shows the exception details of a failed query.

Read on to learn more.

Comments closed

Spark Director Reader in Hive

Anishek Agarwal, et al, announce a new reader for Hive Warehouse Connector:

Apache Hive supports transactional tables which provide ACID guarantees. There has been a significant amount of work that has gone into hive to make these transactional tables highly performant. Apache Spark provides some capabilities to access hive external tables but it cannot access hive managed tables. To access hive managed tables from spark Hive Warehouse Connector needs to be used. 

We are happy to announce Spark Direct Reader mode in Hive Warehouse Connector which can read hive transactional tables directly from the filesystem. This feature has been available from CDP-Public-Cloud-2.0 (7.2.0.0) and CDP-DC-7.1 (7.1.1.0) releases onwards.

Hive Warehouse Connector (HWC) was available to provide access to managed tables in hive from spark, however since this involved communication with LLAP there was an additional hop to get the data and process it in spark vs the ability of spark to directly read the data from FileSystem for External tables. This leads to performance degradation in accessing data from managed tables vs external tables. Additionally a lot of use cases for HWC were associated with ETL jobs where a super user was running these jobs to update data in multiple tables hence authorization was not a strong business need for this case. HWC Spark Direct Reader is an additional mode available in HWC which tries to address the above concerns. This article describes the usage of spark direct reader to consume hive transactional table data in a spark application. It also introduces the methods and APIs to read hive transactional tables into spark dataframes. Finally, it demonstrates the transaction handling and semantics while using this reader.

Click through to learn how it works and see it in action.

Comments closed

Dates and Timestamps in Spark 3.0

Maxim Gekk, et al, look at the different date and time data types in Apache Spark 3.0:

The definition of a Date is very simple: It’s a combination of the yearmonth and day fields, like (year=2012, month=12, day=31). However, the values of the year, month and day fields have constraints, so that the date value is a valid day in the real world. For example, the value of month must be from 1 to 12, the value of day must be from 1 to 28/29/30/31 (depending on the year and month), and so on.

These constraints are defined by one of many possible calendars. Some of them are only used in specific regions, like the Lunar calendar. Some of them are only used in history, like the Julian calendar. At this point, the Gregorian calendar is the de facto international standard and is used almost everywhere in the world for civil purposes. It was introduced in 1582 and is extended to support dates before 1582 as well. This extended calendar is called the Proleptic Gregorian calendar.

Starting from version 3.0, Spark uses the Proleptic Gregorian calendar, which is already being used by other data systems like pandas, R and Apache Arrow. Before Spark 3.0, it used a combination of the Julian and Gregorian calendar: For dates before 1582, the Julian calendar was used, for dates after 1582 the Gregorian calendar was used. This is inherited from the legacy java.sql.Date API, which was superseded in Java 8 by java.time.LocalDate, which uses the Proleptic Gregorian calendar as well.

Even in this three-paragraph snippet, you can already get a feeling for how complex working with dates can be. Then throw in the complexities of time and you get a detailed post full of good information.

Comments closed

Transforming JSON to CSV: ADF vs Databricks

Rayis Imayev compares two methods of transforming a JSON-structured data set into a CSV:

There is a well known and broadly advertised message from Microsoft that Azure Data Factory (ADF) is a code-free environment to help you to create your data integration solutions – https://azure.microsoft.com/en-us/resources/videos/microsoft-azure-data-factory-code-free-cloud-data-integration-at-scale/. I agree and support this approach of using drag and drop visual UI to build and automate data pipelines without writing code. However, I’m also interested to try if I can recreate certain ADF operations by writing code, just out of my curiosity.

Rayis includes a link to the Azure Data Factory step-by-step demonstration and then kicks it up a notch with Databricks. Read on to see how the two compare.

Comments closed

Sparklyr 1.3 Released

Yitao Li announces sparklyr 1.3:

sparklyr 1.3 is now available on CRAN, with the following major new features:

Higher-order Functions to easily manipulate arrays and structs
– Support for Apache Avro, a row-oriented data serialization framework
Custom Serialization using R functions to read and write any data format
Other Improvements such as compatibility with EMR 6.0 & Spark 3.0, and initial support for Flint time series library

Between this and the work from the Spark side, we are seeing some nice quality of life improvements for Spark and R.

Comments closed

Survival Analysis in Spark

Rab Saker and Bryan Smith hit on a topic close to my heart:

These patterns seem to indicate that KKBox could actually differentiate between customers based on their lifetime potential using information known at the time of acquisition. This information might help inform or steer specific discounts or promotions to customers as they register for a trial. This information might also inform KKBox of which offerings or capabilities to discontinue as some, e.g. Initial Payment Method 35 or the 7-day payment plan as shown in Figure 3, align with exceptionally high churn rates in the first 30-days with little long-term survivorship.

Of course, there are relationships between these factors so that we should be careful in viewing them in isolation. By deriving a baseline risk (hazard) of customer churn (Figure 4), we can calculate the influence of different factors on the baseline in such a manner that each factor may be considered an independent hazard multiplier.  When combined (through simple multiplication) against the baseline, we can plot the a specific customer’s chances of abandoning a subscription by a given point in time (Table 1).

Click through for the story as well as a set of notebooks.

Comments closed

Calculating Spark Application Resource Allocations

The Hadoop in Real World team walks us through resource allocation for Spark applications:

In this post we will look at how to calculate resource allocation for Spark applications. Figuring out how to allocate resources for a Spark application requires a good understanding of resource allocation properties in YARN and also resource related properties in Spark. Let’s look at both.

This post covers the properties you want to keep an eye on when running Spark applications.

Comments closed

Downsides to Optimization in Spark SQL

Anuj Saxena takes us through some of the pros and cons of using the Catalyst Optimizer in Spark, including a couple of issues:

I am sure the optimizations make the calculation time very short and these optimizations are implemented in such a way that you just have to provide the logic and everything else will be done in abstraction. But as my friend and colleague Ramandeep says “Abstract features come with abstract issues”. So following are the few issues which I have faced in my recent interaction with Spark SQL:

1. Too large of a query to be stored in memory
2. Implicit optimizations interfere with partitioning

Click through for examples of this.

Comments closed