Press "Enter" to skip to content

Category: Spark

Adaptive Query Execution with Spark SQL

Wenchen Fan, Herman von Hoevell, and MaryAnn Xue announce Adaptive Query Execution for Apache Spark 3.0:

Over the years, there’s been an extensive and continuous effort to improve Spark SQL’s query optimizer and planner in order to generate high-quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark choose better plans. Examples of these cost-based optimization techniques include choosing the right join type (broadcast hash join vs. sort merge join), selecting the correct build side in a hash-join, or adjusting the join order in a multi-way join. However, outdated statistics and imperfect cardinality estimates can lead to suboptimal query plans. Adaptive Query Execution, new in the upcoming Apache SparkTM 3.0 release and available in the Databricks Runtime 7.0 beta, now looks to tackle such issues by reoptimizing and adjusting query plans based on runtime statistics collected in the process of query execution.

One of the biggest advantages of SQL as a fourth-generation language is that the database engine (whether that be SQL Server, Oracle, or Spark) gets the opportunity to write and re-write the set of operations needed to solve a query to try to find the best path which returns the same result set. These optimizations aren’t perfect, as any query tuner can tell you, but they can go a long way.

Comments closed

Pandas UDFs and Python Type Hints in Spark 3.0

Hyukjin Kwon announces some updates forthcoming in Apache Spark 3.0:

The Pandas UDFs work with Pandas APIs inside the function and Apache Arrow for exchanging data. It allows vectorized operations that can increase performance up to 100x, compared to row-at-a-time Python UDFs.

The example below shows a Pandas UDF to simply add one to each value, in which it is defined with the function called pandas_plus_one decorated by pandas_udf with the Pandas UDF type specified as PandasUDFType.SCALAR.

Click through for explanations and demos for each.

Comments closed

Feeding Databricks Output to Azure SQL Database

Arun Sirpal takes us through the process of moving data from Databricks into Azure SQL Database:

Recently I got to a stage where I leveraged Databricks to the best of my ability to join couple of CSV files together, play around some aggregations and then output it back to a different mount point ( based on Azure Storage) as a parquet file, I decided that I actually wanted to move this data into Azure SQL DB, which you may want to do one day.

This isn’t just dropping files into Blob Storage and picking them up, but rather a direct integration.

Comments closed

Spark Application Execution Modes

Kundan Kumarr explains how the two execution modes differ with Apache Spark:

Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will be assigned a task and it will consolidate and collect the result back to the driver. A spark application gets executed within the cluster in two different modes – one is cluster mode and the second is client mode.

Click through for a comparison.

Comments closed

Azure Synapse Analytics in Preview

Simon Whiteley clarifies a Build announcement:

Today’s the day! There’s much buzz & excitement as we FINALLY get to see Azure Synapse Analytics in public preview, ready for us all to get our hands on it. There’s a raft of other announcements that come hand & hand with it too.

What’s that? You thought Azure Synapse Analytics was already available? You’ve been using all year and don’t see what the fuss is about??

I’m expecting this to be the common reaction. The marketing story for Synapse has been… interesting… to say the least. I’ve been asked several times in the last week exactly what the new story is and, given today’s news, I thought I’d clarify.

The big picture is the version of Azure Synapse Analytics I’ve been interested in for a bit, so it’s nice to see the movement here.

Comments closed

Spark UDFs and Error Handling

Bipin Patwardhan takes us through an error-handling scenario when writing a Spark User-Defined Function:

A couple of weeks ago, at my work place, I wrote a metadata-driven data validation framework for Spark. After the initial euphoria of having created the framework in Scala/Spark and Python/Spark, I started reviewing the framework. During the review, I noted that the User Defined Functions (UDF) I had written were prone to throw an error in certain situations.

I then explored various options to make the UDFs fail-safe.

These are like any other code: you want it to be as robust to failure as you can get it (or at least robust enough at the margin).

Comments closed

Don’t Install Hadoop on Windows

Hadi speaks truth:

A few days ago, I published the installation guides for Hadoop, Hive, and Pig on Windows 10. And yesterday, I finished installing and configuring the ecosystem. The only consequence I have is that “Think 1000 times before installing Hadoop and related technologies on Windows!”.

The biggest problem is that Microsoft got flaky about this. Back in 2012-2013, they backed running Hadoop on Windows as part of getting HDInsight up and running. I even remember the HDInsight emulator which could run on a local desktop. By 2014 or so, they shifted directions and decided it wasn’t worth the effort. Because Apache Spark (which does have pretty decent Windows support, at least for development) really wants Hive, you can fake it with winutils.

Comments closed

Technology Choices for Streaming Pipelines

The Hadoop in Real World team takes us through different tools available when working on streaming pipelines:

Businesses want to get insights as quickly as possible and do not want to wait for a day, like before, to bring up a report to understand what happened till yesterday. They require a more proactive approach that can help to act immediately when something significant happens and also to prevent the system from any faults/downtime before it occurs. Imagine you are buying some product from an e-retailer and you have gone till the point to make payment and something happened that caused the payment not to go through successfully. At that very moment, you are having a second thought about whether to buy the product now or later. Suppose, if the business is getting a report of this occurrence next day, it would not be of much use for them as the customer would have already bought it from somewhere or decided against it. This is where real-time events and insights come in. If it were a real-time report, the team would have called up the customer and made the purchase by offering some discounts, which in turn would have changed the mind of the customer.

Click through for a high-level discussion of these tools.

Comments closed

Security Practices for Azure Databricks

Abhinav Garg and Anna Shrestinian walk us through good security practices when using Azure Databricks:

Azure Databricks is a Unified Data Analytics Platform that is a part of the Microsoft Azure Cloud. Built upon the foundations of Delta LakeMLflowKoalas and Apache SparkTM, Azure Databricks is a first party PaaS on Microsoft Azure cloud that provides one-click setup, native integrations with other Azure cloud services, interactive workspace, and enterprise-grade security to power Data & AI use cases for small to large global customers. The platform enables true collaboration between different data personas in any enterprise, like Data Engineers, Data Scientists, Business Analysts and SecOps / Cloud Engineering.

In this article, we will share a list of cloud security features and capabilities that an enterprise data team could utilize to bake their Azure Databricks environment as per their governance policy.

Much of this is fairly straightforward, but it is nice to have it all in one place.

Comments closed