Press "Enter" to skip to content

Category: Spark

Adaptive Query Execution in Spark 3

Amarjeet Singh explains what Adaptive Query Execution is in Apache Spark:

As we all know optimization plays an important role in the success of spark SQL. Therefore, a lot of work has been done in this direction. Before spark 3.0, cost-based optimization was a major hit in which different stages related to cost (based on time efficiency and estimated CPU and I/O usage) are compared and executes the strategy which minimizes the cost. But, because of outdated statistics, it has become a sub-optimal technique. Therefore in spark 3.0, Adaptive Query Execution was introduced which aims to solve this by reoptimizing and adjusts the query plans based on runtime statistics collected during query execution. Thus re-optimization of the execution plan occurs after every stage as each stage gives the best place to do the re-optimization.

Item number 2 from the list is also available in SQL Server, giving you an idea that this is an active battleground for query processing in data platform technologies.

Comments closed

Why 200 Tasks for a Spark Execution?

The Hadoop in Real World team explains why you might see 200 tasks when running a Spark job:

It is quite common to see 200 tasks in one of your stages and more specifically at a stage which requires wide transformation. The reason for this is, wide transformations in Spark requires a shuffle. Operations like join, group by etc. are wide transform operations and they trigger a shuffle.

Read on to learn why 200, and whether 200 is the right number for you.

Comments closed

SCD Type 2 with Delta Lake

Chris Williams continues a series on slowly changing dimensions in Delta Lake:

Type 2 SCD is probably one of the most common examples to easily preserve history in a dimension table and is commonly used throughout any Data Warehousing/Modelling architecture. Active rows can be indicated with a boolean flag or a start and end date. In this example from the table above, all active rows can be displayed simply by returning a query where the end date is null.

Read on to see how you can implement this pattern using Delta Lake’s capabilities.

Comments closed

Databricks Notebook Discovery via Notebooks

Darin McBeth creates a meta-noterbook to keep track of notebooks:

Elsevier has been a customer of Databricks for about six years. There are now hundreds of users and tens of thousands of notebooks across their workspace. To some extent, Elsevier’s Databricks users have been a victim of their own success, as there are now too many notebooks to search through to find some earlier work.

The Databricks workspace does provide a keyword search, but we often find the need to define advanced search criteria, such as creator, last updated, programming language, notebook commands and results.

Interestingly, we managed to achieve this functionality using a 100% notebook-based solution with Databricks functionalities. As you will see, this makes it easy to set up in a customer’s Databricks environment.

Read on to see how.

Comments closed

Visualization in Spark with Drsti

Jean-Georges Perrin shows off a Spark library:

I was looking for an effortless data visualization that would interface easily with Apache Spark. I found a few interesting tools, but nothing that would not require some complex interfacing, setup, or infrastructure. In a good geek way, I then decided to write the tool. This lack of simple tools is how Drsti (pronounced drishti) was born.

Aren’t you tired of looking at dataframes that looked like they came straight from a 1980 VT100? Sure, if you use notebooks, either standalone or hosted (IBM Watson Studio, Databricks…), you are not (or less) confronted with the issue. However, if you are building pipelines outside of the Data Science toys, oops, tools, you may need to visualize data in a graph.

Read on to see how it works and some of what you can do with Drsti.

Comments closed

CI/CD with Databricks Notebooks and Azure DevOps

Michael Shtelma and Piotr Majer get us started on an MLOps journey:

This is the first part of a two-part series of blog posts that show how to configure and build end-to-end MLOps solutions on Databricks with notebooks and Repos API. This post presents a CI/CD framework on Databricks, which is based on Notebooks. The pipeline integrates with the Microsoft Azure DevOps ecosystem for the Continuous Integration (CI) part and Repos API for the Continuous Delivery (CD).In the second post, we’ll show how to leverage the Repos API functionality to implement a full CI/CD lifecycle on Databricks and extend it to the fully-blown MLOps solution.

Click through for the article and a link to code. You can also see the pipeline YAML (and Python code it calls) in the repo.

Comments closed

Spark Performance Improvements in Azure Synapse

Balaji Sankaran shows improvements Microsoft has made over open-source Apache Spark 3 in Azure Synapse Analytics:

Azure Synapse Analytics is continually focused on delivering a highly performant and scalable platform for supporting Spark Workload. We are focused on improving the query performance for the typical workload patterns that we see with our customers. By combining the latest open-source updates in Apache Spark with our team’s focus on performance updates we have made significant performance gains in standard TPC-DS benchmarking tests.

I expect it will never be as fast as what Databricks can do, but getting a 2x performance improvement over the open source version of Spark is nothing to sneeze at.

Comments closed

Type 1 SCDs in Delta Lake

Chris Williams starts a series on slowly changing dimensions in a Delta Lake:

Anyone that has contributed towards a Data Warehouse or a dimensional model in Power BI will know the distinction made between the time-series metrics of a Fact Table and the categorised attributes of a Dimension Table. These dimensions are also affected by the passage of time and require revised descriptions periodically which is why they are known as Slowly Changing Dimensions (SCD). See The Data Warehouse Toolkit – Kimball & Ross for more information.

Here is where the Delta Lake comes in. Using its many features such as support for ACID transactions (Atomicity, Consistency, Isolation and Durability) and schema enforcement we can create the same durable SCD’s. This may have required a series of complicated SQL statements in the past to achieve this. I will now discuss a few of the most common SCD’s and show how they can be easily achieved using a few Databricks Notebooks, which are available from my GitHub repo so you can download and have a go:

https://github.com/cwilliams87/Blog-SCDs

Check out the repo, but be sure to read the whole post.

Comments closed

Data Lakehouse Point-of-Sale Analytics Demo

Bryan Smith and Rob Saker share a pattern:

Disruptions in the supply chain – from reduced product supply and diminished warehouse capacity – coupled with rapidly shifting consumer expectations for seamless omnichannel experiences are driving retailers to rethink how they use data to manage their operations. Prior to the pandemic, 71% of retailers named lack of real-time visibility into inventory as a top obstacle to achieving their omnichannel goals. The pandemic only increased demand for integrated online and in-store experiences, placing even more pressure on retailers to present accurate product availability and manage order changes on the fly. Better access to real-time information is the key to meeting consumer demands in the new normal.

In this blog, we’ll address the need for real-time data in retail, and how to overcome the challenges of moving real-time streaming of point-of-sale data at scale with a data lakehouse.

It’s a cool scenario, at the least.

Comments closed