Spark – Page 30 – Curated SQL

Q&A about the Lakehouse

Published 2020-11-20 by Kevin Feasel

Terry McCann posts Q&A from Simon Whiteley’s session on Lakehouse models in Spark 3.0:

“WHILE ALL THE HADOOP PROVIDERS PROMOTED THE DATALAKE PARADIGM BACK THEN, HOW THE INDUSTRY AND THE OTHER DATA LAKE PROVIDERS ARE SHIFTING TO/CONSIDERING THE LAKE HOUSE PARADIGM?“
It’s a direction that most providers are heading in, albeit under the “unified analytics” or “modern warehouse” name rather than the “lakehouse”. But most big relational engines are moving to bring in spark/big data capabilities, other lake providers are looking to expand their SQL coverage. It’s a bit of a race to who gets to the “can do both sides as well as a specialist tool” point first. Will we see other tools championing it as a “lakehouse”, or is that term now tied too closely as a “vendor-specific” term coming from Databricks? We’ll see…

Click through for some good questions and thoughtful answers.

Comments closed

The Evolving Lakehouse

Published 2020-11-13 by Kevin Feasel

Simon Whiteley looks at the current status of the Lakehouse model:

We have discussed in the past this idea of the lakehouse, the aspirational target of many analytics platforms these days of combining the huge power and potential of data lakes with the rigour, reliability and concurrency of a data warehouse. It’s an interesting concept but has, in the past, been firmly an aspiration.
In the world without lakehouses, we often see the “Modern Data Warehouse”, this two-phased approach to providing a holistic platform – we load our early data into a lake where we shape it and massage it into an understandable state. It is here we perform data science, exploratory data analysis, early sight analytics prototyping and various other functions that don’t quite fit into a data warehouse… but then we load our data into a relational store for serving to the business. This is where we can meet their demands for a rich SQL environment, auditable data models and rigorous change procedures. Essentially, we store data twice so that we can achieve the best of both worlds.

Definitely read Simon’s take on it. My take is that the Lakehouse concept will start to be useful to specific companies in about 2-3 years, as I don’t think the performance is there today.

Comments closed

Automating Hadoop Workflows with Spark and Oozie

Published 2020-11-06 by Kevin Feasel

Prashanth Jayaram walks us through automating a sample data transfer with tools like Sqoop, Spark, and Oozie:

In the process of building a data product one would end-up applying many resource-intensive analytical operations on a medium to large data-set in an efficient way. Apache Spark is the bet in this scenario to perform faster job execution by caching data in memory and enabling parallelism in a distributed data environments.
Components involved in Spark implementation:
1. Initialize spark session using scala program
2. Ingest data from data lake through hive queries
3. Apply business logic using scala constructs or hive queries
4. Load data into HDFS or Hive targets
5. Execute spark programs through spark submit

Read on for a sample flow.

Comments closed

Join Execution in Apache Spark

Published 2020-11-04 by Kevin Feasel

Ajay Gupta takes us through join operations in Apache Spark:

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.
At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an individual record, being matched, from either of the two data sets or a Joined record. The joined record basically represents the combination of individual records, being matched, from both the data sets.

Click through for more information on the mechanics of joining, including trade-offs between types of physical join operators.

Comments closed

Adding Libraries in Databricks

Published 2020-11-03 by Kevin Feasel

Arun Sirpal has some third-party libraries to add:

It is a really common requirement to add specific libraries to databricks. Libraries can be written in Python, Java, Scala, and R. You can upload Java, Scala, and Python libraries and point to external packages in PyPI, Maven, and CRAN repositories.
Libraries can be added in 3 scopes. Workspace, Notebook-scoped and cluster. I want to show you have easy it is to add (and search) for a library that you can add to the cluster, so that all notebooks attached to the cluster can leverage the library.

I’m hoping that loading libraries in Azure Synapse Analytics will, at some point, be this convenient.

Comments closed

Dotnet-Spark UDFs and Missing Shared State

Published 2020-11-02 by Kevin Feasel

Ed Elliott uncovers a mystery:

To understand this we need to take a look at how we can create a UDF in .NET that is called by the Java VM Apache Spark code because, that is logically, what happens. In our application we call into Apache Spark and ask it to do things like read from a file, run some transformation and write files back out again. With UDF’s, we ask Spark to run a UDF and Spark comes back to our UDF, passing it some data and asks the UDF to execute but the Java VM does not understand how to execute .NET code.

Read the whole thing.

Comments closed

Using Key Vault in Azure Databricks

Published 2020-10-29 by Kevin Feasel

Arun Sirpal shows us how easy it is to tie Azure Key Vault into Azure Databricks:

The key vault should always be a core component of your Azure design because we can store keys, secrets, certicates thus abstract / hide the true connection string within files. When working with databricks to mount storage to ingest your data and query it ideally you should be leveraging this to create secrets and secret scopes.

Click through for a demo.

Comments closed

Outlier Identification Using Spark 3.0

Published 2020-10-27 by Kevin Feasel

Tori Tompkins takes us through principles of anomaly detection in Apache Spark 3.0:

To calculate Median Absolute Deviation (MAD) you need to calculate the difference between the value and the median. In simpler terms, you will need to calculate the median of the entire dataset, the difference between each value and this median, then take another median of all the differences.
In Spark you can use a SQL expression ‘percentile()’ to calculate any medians or quartiles in a dataframe. ‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ because we want the 50% value ie median) and will return an array of results.
Like standard deviation, to use MAD to identify the outliers it needs to be a certain number of MAD’s away. This number is also referred to as the threshold and is defaulted to 3.

Read on for three measures and their implementations in PySpark.

Comments closed

Features and Improvements in Spark 3.0

Published 2020-10-27 by Kevin Feasel

Manoj Pandey summarizes some of the improvements in Apache Spark 3.0:

With Spark 3.0 release (on June 2020) there are some major improvements over the previous releases, some of the main and exciting features for Spark SQL & Scala developers are AQE (Adaptive Query Execution), Dynamic Partition Pruning and other performance optimization and enhancements.
Below I’ve listed out these new features and enhancements all together in one page for better understanding and future reference.

Click through for the summary.

Comments closed

Adaptive Query Execution in Databricks

Published 2020-10-26 by Kevin Feasel

MaryAnn Xue and Allison Wang explain how Adaptive Query Execution works with Databricks:

One of the most important cost-based decisions made in the Spark optimizer is the selection of join strategies, which is based on the size estimation of the join relations. But since this estimation can go wrong in both directions, it can either result in a less efficient join strategy because of overestimation, or even worse, out-of-memory errors because of underestimation.
AQE offers a trouble-free solution here by switching to the faster broadcast hash join during execution time.

This is pretty similar to Adaptive Query Processing in SQL Server.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark