Press "Enter" to skip to content

Category: Spark

Spark RDDs and DataFrames

Ayush Hooda explains the difference between RDDs and DataFrames:

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.

One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.

Before exploring these APIs, let’s understand the need for these APIs.

I like the piece about RDDs being better at explaining the how than the what.

Comments closed

Selecting a List of Columns from Spark

Unmesha SreeVeni shows us how we can create a list of column names in Scala to pass into a Spark DataFrame’s select function:

Now our example dataframe is ready.
Create a List[String] with column names.
scala> var selectExpr : List[String] = List("Type","Item","Price") selectExpr: List[String] = List(Type, Item, Price)

Now our list of column names is also created.
Lets select these columns from our dataframe.
Use .head and .tail to select the whole values mentioned in the List()

Click through for a demo.

Comments closed

Bayesian Modeling Of Hardware Failure Rates

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates:

This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds almost identically.

Of course, on the first day this streaming analysis is rolled out, it starts from nothing. Even after two quarters of data here, there’s still significant uncertainty about failure rates, because failures are rare.

An organization that’s transitioning this kind of offline data science to an online streaming context probably does have plenty of historical data. This is just the kind of prior belief about failure rates that can be injected as a prior distribution on failure rates!

Bayesian approaches work really well with streaming data if you think of the streams as sampling events used to update your priors to a new posterior distribution.

Comments closed

Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming:

Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.

As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!

For me, it’s DataFrames all day. But Yaroslav has a more nuanced answer which is worth reading. There are also a couple of good examples.

Comments closed

Databricks Runtime 5.2 Released

Nakul Jamadagni announces Databricks Runtime 5.2:

Delta Time Travel
Time Travel, released as an Experimental feature, adds the ability to query a snapshot of a table using a timestamp string or a version, using SQL syntax as well as DataFrameReader options for timestamp expressions.
Sample code
SELECT count(FROM events TIMESTAMP AS OF timestamp_expression
SELECT count(
FROM events VERSION AS OF version

Time travel looks a bit like temporal tables in SQL Server.

Comments closed

Native Math Libraries And Spark ML

Zuling Kang shares with us how we can use native math libraries in netlib-java to speed up certain machine learning algorithms in Apache Spark:

Spark’s MLlib uses the Breeze linear algebra package, which depends on netlib-java for optimized numerical processing.  netlib-java is a wrapper for low-level BLASLAPACK, and ARPACK libraries. However, due to licensing issues with runtime proprietary binaries, neither the Cloudera distribution of Spark nor the community version of Apache Spark includes the netlib-java native proxies by default. So without manual configuration, netlib-java only uses the F2J library, a Java-based math library that is translated from Fortran77 reference source code.

To check whether you are using native math libraries in Spark ML or the Java-based F2J, use the Spark shell to load and print the implementation library of netlib-java. The following commands return information on the BLAS library and include that it is using F2J in the line, “com.github.fommil.netlib.F2jBLAS,” which is highlighted below:

In the examples here, you can get about a 2x difference using the native math libraries versus without, so although that’s not an order of magnitude difference, it’s still nothing to sneeze at.

Comments closed

Overwriting Data In Use With Databricks

Piotr Starczynski shows us how we can read data from a table, transform it, and write it back to the same file:

Recently I have reached interesting problem in Databricks Non delta. I tried to read data from the the table (table on the top of file) slightly transform it and write it back to the same location that i have been reading from. Attempt to execute code like that would manifest with exception:“org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from”

The lets try to answer the question How to write into a table(dataframe) that we are reading from as this might be a common use case?

This problem is trivial but it is very confusing, if we do not understand how queries are processed in spark.

Click through for the answer. I’m a little squeamish about doing this because my expectation is for data to flow from one source to another source; feeding the data back to the initial source feels strange, like running a load of clothes through the washer and dryer and then dumping them back into the hamper with the remainder of the dirty clothes.

Comments closed

Azure Data Factory Data Flows

Joost van Rossum takes a look at data flows in Azure Data Factory:

2) Create Databricks Service
Yes you are reading this correctly. Under the hood Data Factory is using Databricks to execute the Data flows, but don’t worry you don’t have to write code.
Create a Databricks Service and choose the right region. This should be the same as your storage region to prevent high data movement costs. As Pricing Tier you can use Standard for this introduction. Creating the service it self doesn’t cost anything.

Joost shows the work you have to do to build out a data flow. This has been a big hole in ADF—yeah, ADF seems more like an ELT tool than an ETL tool but even within that space, there are times when you need to do a bit more than pump-and-dump.

Comments closed

Spark And Splitting DataFrames

Giovanni Lanzani explains that one technique to split a data frame doesn’t quite work as expected:

Recently I was delivering a Spark course. One of the exercises asked the students to split a Spark DataFrame in two, non-overlapping, parts.

One of the students came up with a creative way to do so.

He started by adding a monotonically increasing ID column to the DataFrame. Spark has a built-in function for this, monotonically_increasing_id — you can find how to use it in the docs.

Read on to see how this didn’t quite work right, why it didn’t work as expected, and one alternative.

Comments closed

A Functional Approach To PySpark

Tristan Robinson shows us how we can implement a transform function which makes Python code look a little bit more functional:

After a small bit of research I discovered the concept of monkey patching (modifying a program to extend its local execution) the DataFrame object to include a transform function. This function is missing from PySpark but does exist as part of the Scala language already.

The following code can be used to achieve this, and can be stored in a generic wrapper functions notebook to separate it out from your main code. This can then be called to import the functions whenever you need them.

Things which make Python more of a functional language are fine by me. Even though I’d rather use Scala.

Comments closed