Press "Enter" to skip to content

Category: Spark

A Brief Overview of Azure Synapse Analytics

Ginger Grant gives us the nickel tour of why Azure Synapse Analytics is interesting:

In the past few months, I have been examining Azure Synapse and what it can do.  When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW.  For this reason, many people think that Synapse is just an improved version of a cloud data warehouse.  Microsoft did improve SQL DW when it moved it to Synapse.  The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change.  Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource.  You will still need to partition your DW as large datasets require partitioning to perform well.  Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.

But it’s more than that. Read on to see what else is available.

Comments closed

TF-IDF using Spark .NET

Ed Elliott shows how you can use the Spark .NET library to perform machine learning in Apache Spark:

Native spark has two API’s for creating your ML applications. The historical one is Spark.MLLib and the newer API is Spark.ML. A little bit like how there was the old RDD API which the DataFrame API superseded, Spark.ML supersedes Spark.MLLib.

At the end of last year, .NET for Apache Spark had no support for either the Spark.ML or Spark.MLLib API’s so I started implementing Spark.ML. In a similar way that .NET for Apache Spark supports the DataFrame API and not the RDD API, I started with Spark.ML and I believe that having the full Spark ML API will be enough for anyone.

It’s awesome that Ed is helping to move Spark .NET forward in this way.

Comments closed

Comparing Koalas to PySpark

Tori Tompkins gives us an understanding of where Koalas fits in the Spark world:

One significant difference between Spark’s implementation of Dataframes and pandas is its immutability.

With Spark dataframes, you are unable to make changes to the existing object but rather create a brand new dataframe based on the old one. Pandas dataframes, however, allow you to edit the object in place. With Koalas, whilst still spark Dataframes under the hood, have kept the mutable syntax of pandas.

It does this by introducing this concept of an ‘Internal Frame’. This holds the spark immutable dataframe and manages the mapping between the Koalas column names and Spark column names. It also manages the Koalas index names to spark column name to replicate the index functionality in pandas (covered below). It acts as a bridge between Spark and Koalas by mimicking the pandas API with Spark. This Internal Frame replicates the mutable functionality of pandas by creating copies of the internal frame but appearing to be mutable.

Read the whole thing.

Comments closed

Good Practices when Combining Spark with Cassandra

Valerie Parham-Thompson shares some insights for working with Spark and Cassandra together:

Although we are focusing on Cassandra as the data storage in this presentation, other storage sources and destinations are possible. Another frequently used data storage option is Hadoop HDFS. The previously mentioned spark-cassandra-connector has capabilities to write results to Cassandra, and in the case of batch loading, to read data directly from Cassandra.

Native data output formats available include both JSON and Parquet. The Parquet format in particular is useful for writing to AWS S3. See https://aws.amazon.com/about-aws/whats-new/2018/09/amazon-s3-announces-new-features-for-s3-select/ for more information on querying S3 files stored in Parquet format. A good use case for this is archiving data from Cassandra.

Read on for more advice.

Comments closed

The Basics of Spark Streaming

Muskan Gupta gives us an introduction to Spark Streaming:

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It was added to Apache Spark in 2013. We can get data from many sources such as Kafka, Flume etc. and process it using functions such as map, reduce etc. After processing we can push data to filesystem, databases and even to live dashboards.

In Spark Streaming we work on near real time data. It divides the received input stream into batches. The Spark Engine processes the batches and generate final output in batches.

Read on to understand the key mechanisms behind Spark Streaming.

Comments closed

Apache Spark Connector for SQL Server

The SQL Server team announces an open-sourced Apache Spark connector for SQL Server:

The Apache Spark Connector for SQL Server and Azure SQL is based on the Spark DataSourceV1 API and SQL Server Bulk API and uses the same interface as the built-in JDBC Spark-SQL connector. This allows you to easily integrate the connector and migrate your existing Spark jobs by simply updating the format parameter! 

This appears to be different from the old Spark connector to Azure SQL Database and SQL Server. Also, for anyone potentially confused between it and PolyBase, this is going in the opposite direction: the Spark connector lets you access a SQL Server from an Apache Spark cluster, reading SQL Server’s data and processing it across a number of executor nodes. By contrast, PolyBase lets you read data stored in Spark SQL tables from SQL Server, virtualizing it so that it looks like a regular SQL Server table.

Comments closed

Calculating Partitions for Processing Data Files in Apache Spark

Ajay Gupta digs into how to calculate the number of partitions the different Spark APIs use when reading from files:

Until recently, the process of picking up a certain number of partitions against a set of data files, always looked mysterious to me. However, recently, during an optimization routine, I wanted to change the default number of partitions picked by Spark for processing a set of data files, and that is when I started to decode this process comprehensively along with proofs. Hopefully, the description of this decoded process would also help the readers to understand Spark a bit deeper and would enable them to design an efficient and optimized Spark routine.

This is important information if you’re tuning Spark cluster performance.

Comments closed

Window Functions in Spark SQL

Juoko Virtanen walks us through window functions in Spark SQL:

When you think of windows in Spark you might think of Spark Streaming, but windows can be used on regular DataFrames. Window functions calculate an output value for every row of a DataFrame based on a group of rows. I have been working on optimizing some Spark code and have noticed a few places where the use of a window function eliminates the need for a join and speeds up the code. A common pattern where a window can be used to replace a join is when an aggregation is performed on a DataFrame and then the DataFrame resulting from the aggregation is joined to the original DataFrame. Let’s take a look at an example.

Read on for a few examples using the Scala flavor of Spark SQL.

Comments closed

Vectorized R I/O in Apache Spark 3.0

Hyukjin Kwon gives us a preview of SparkR improvements in Apache Spark 3.0:

When SparkR does not require interaction with the R process, the performance is virtually identical to other language APIs such as Scala, Java and Python. However, significant performance degradation happens when SparkR jobs interact with native R functions or data types.

Databricks Runtime introduced vectorization in SparkR to improve the performance of data I/O between Spark and R. We are excited to announce that using the R APIs from Apache Arrow 0.15.1, the vectorization is now available in the upcoming Apache Spark 3.0 with the substantial performance improvements.

This blog post outlines Spark and R interaction inside SparkR, the current native implementation and the vectorized implementation in SparkR with benchmark results.

Certain operations get ridiculously faster with this change.

Comments closed