Spark – Page 36 – Curated SQL

Pandas UDFs and Python Type Hints in Spark 3.0

Published 2020-05-29 by Kevin Feasel

Hyukjin Kwon announces some updates forthcoming in Apache Spark 3.0:

The Pandas UDFs work with Pandas APIs inside the function and Apache Arrow for exchanging data. It allows vectorized operations that can increase performance up to 100x, compared to row-at-a-time Python UDFs.
The example below shows a Pandas UDF to simply add one to each value, in which it is defined with the function called pandas_plus_one decorated by pandas_udf with the Pandas UDF type specified as PandasUDFType.SCALAR.

Click through for explanations and demos for each.

Comments closed

Feeding Databricks Output to Azure SQL Database

Published 2020-05-29 by Kevin Feasel

Arun Sirpal takes us through the process of moving data from Databricks into Azure SQL Database:

Recently I got to a stage where I leveraged Databricks to the best of my ability to join couple of CSV files together, play around some aggregations and then output it back to a different mount point ( based on Azure Storage) as a parquet file, I decided that I actually wanted to move this data into Azure SQL DB, which you may want to do one day.

This isn’t just dropping files into Blob Storage and picking them up, but rather a direct integration.

Comments closed

Two Ways to Create a DataFrame in Spark

Published 2020-05-26 by Kevin Feasel

Kundan Kumarr explains two techniques for creating DataFrames in Apache Spark:

There are two different ways to create a Dataframe in Spark. First, using toDF() and second is using createDataFrame(). In this blog we will see how we can create Dataframe using these two methods and what’s the exact difference between them.

Read on for the explanation and examples.

Comments closed

Spark Application Execution Modes

Published 2020-05-21 by Kevin Feasel

Kundan Kumarr explains how the two execution modes differ with Apache Spark:

Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will be assigned a task and it will consolidate and collect the result back to the driver. A spark application gets executed within the cluster in two different modes – one is cluster mode and the second is client mode.

Click through for a comparison.

Comments closed

Azure Synapse Analytics in Preview

Published 2020-05-20 by Kevin Feasel

Simon Whiteley clarifies a Build announcement:

Today’s the day! There’s much buzz & excitement as we FINALLY get to see Azure Synapse Analytics in public preview, ready for us all to get our hands on it. There’s a raft of other announcements that come hand & hand with it too.
What’s that? You thought Azure Synapse Analytics was already available? You’ve been using all year and don’t see what the fuss is about??
I’m expecting this to be the common reaction. The marketing story for Synapse has been… interesting… to say the least. I’ve been asked several times in the last week exactly what the new story is and, given today’s news, I thought I’d clarify.

The big picture is the version of Azure Synapse Analytics I’ve been interested in for a bit, so it’s nice to see the movement here.

Comments closed

Spark UDFs and Error Handling

Published 2020-05-14 by Kevin Feasel

Bipin Patwardhan takes us through an error-handling scenario when writing a Spark User-Defined Function:

A couple of weeks ago, at my work place, I wrote a metadata-driven data validation framework for Spark. After the initial euphoria of having created the framework in Scala/Spark and Python/Spark, I started reviewing the framework. During the review, I noted that the User Defined Functions (UDF) I had written were prone to throw an error in certain situations.
I then explored various options to make the UDFs fail-safe.

These are like any other code: you want it to be as robust to failure as you can get it (or at least robust enough at the margin).

Comments closed

Don’t Install Hadoop on Windows

Published 2020-05-11 by Kevin Feasel

Hadi speaks truth:

A few days ago, I published the installation guides for Hadoop, Hive, and Pig on Windows 10. And yesterday, I finished installing and configuring the ecosystem. The only consequence I have is that “Think 1000 times before installing Hadoop and related technologies on Windows!”.

The biggest problem is that Microsoft got flaky about this. Back in 2012-2013, they backed running Hadoop on Windows as part of getting HDInsight up and running. I even remember the HDInsight emulator which could run on a local desktop. By 2014 or so, they shifted directions and decided it wasn’t worth the effort. Because Apache Spark (which does have pretty decent Windows support, at least for development) really wants Hive, you can fake it with winutils.

Comments closed

Technology Choices for Streaming Pipelines

Published 2020-05-08 by Kevin Feasel

The Hadoop in Real World team takes us through different tools available when working on streaming pipelines:

Businesses want to get insights as quickly as possible and do not want to wait for a day, like before, to bring up a report to understand what happened till yesterday. They require a more proactive approach that can help to act immediately when something significant happens and also to prevent the system from any faults/downtime before it occurs. Imagine you are buying some product from an e-retailer and you have gone till the point to make payment and something happened that caused the payment not to go through successfully. At that very moment, you are having a second thought about whether to buy the product now or later. Suppose, if the business is getting a report of this occurrence next day, it would not be of much use for them as the customer would have already bought it from somewhere or decided against it. This is where real-time events and insights come in. If it were a real-time report, the team would have called up the customer and made the purchase by offering some discounts, which in turn would have changed the mind of the customer.

Click through for a high-level discussion of these tools.

Comments closed

Security Practices for Azure Databricks

Published 2020-05-07 by Kevin Feasel

Abhinav Garg and Anna Shrestinian walk us through good security practices when using Azure Databricks:

Azure Databricks is a Unified Data Analytics Platform that is a part of the Microsoft Azure Cloud. Built upon the foundations of Delta Lake, MLflow, Koalas and Apache Spark^TM, Azure Databricks is a first party PaaS on Microsoft Azure cloud that provides one-click setup, native integrations with other Azure cloud services, interactive workspace, and enterprise-grade security to power Data & AI use cases for small to large global customers. The platform enables true collaboration between different data personas in any enterprise, like Data Engineers, Data Scientists, Business Analysts and SecOps / Cloud Engineering.
In this article, we will share a list of cloud security features and capabilities that an enterprise data team could utilize to bake their Azure Databricks environment as per their governance policy.

Much of this is fairly straightforward, but it is nice to have it all in one place.

Comments closed

Dynamic File Pruning on Delta Lake

Published 2020-05-06 by Kevin Feasel

Ali Afroozeh, et al, take us through Dynamic File Pruning in Databricks Runtime 6.1:

In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible. This can be achieved because Delta Lake automatically collects metadata about data files managed by Delta Lake and so, data can be skipped without data file access. Prior to Dynamic File Pruning, file pruning only took place when queries contained a literal value in the predicate but now this works for both literal filters as well as join filters. This means that Dynamic File Pruning now allows star schema queries to take advantage of data skipping at file granularity.

There are some interesting performance results here. I’d also be curious to see how robust the results are as queries get more complicated

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark