Press "Enter" to skip to content

Category: Hadoop

Why 200 Tasks for a Spark Execution?

The Hadoop in Real World team explains why you might see 200 tasks when running a Spark job:

It is quite common to see 200 tasks in one of your stages and more specifically at a stage which requires wide transformation. The reason for this is, wide transformations in Spark requires a shuffle. Operations like join, group by etc. are wide transform operations and they trigger a shuffle.

Read on to learn why 200, and whether 200 is the right number for you.

Comments closed

SCD Type 2 with Delta Lake

Chris Williams continues a series on slowly changing dimensions in Delta Lake:

Type 2 SCD is probably one of the most common examples to easily preserve history in a dimension table and is commonly used throughout any Data Warehousing/Modelling architecture. Active rows can be indicated with a boolean flag or a start and end date. In this example from the table above, all active rows can be displayed simply by returning a query where the end date is null.

Read on to see how you can implement this pattern using Delta Lake’s capabilities.

Comments closed

Databricks Notebook Discovery via Notebooks

Darin McBeth creates a meta-noterbook to keep track of notebooks:

Elsevier has been a customer of Databricks for about six years. There are now hundreds of users and tens of thousands of notebooks across their workspace. To some extent, Elsevier’s Databricks users have been a victim of their own success, as there are now too many notebooks to search through to find some earlier work.

The Databricks workspace does provide a keyword search, but we often find the need to define advanced search criteria, such as creator, last updated, programming language, notebook commands and results.

Interestingly, we managed to achieve this functionality using a 100% notebook-based solution with Databricks functionalities. As you will see, this makes it easy to set up in a customer’s Databricks environment.

Read on to see how.

Comments closed

Partitioning vs Bucketing in Hive

The Hadoop in Real World team explains the difference between partitioning and bucketing in Apache Hive tables:

Now let’s say you also filter the sales record by sku (stock-keeping unit aka. barcode)  in addition to sale_date and country. Creating a partition on sku will result in many partitions which is not ideal as it might result in uneven and smaller partitions.

Hadoop is not efficient in processing small volumes of data. There is a better way.

Read on to understand when each technique makes sense.

Comments closed

Visualization in Spark with Drsti

Jean-Georges Perrin shows off a Spark library:

I was looking for an effortless data visualization that would interface easily with Apache Spark. I found a few interesting tools, but nothing that would not require some complex interfacing, setup, or infrastructure. In a good geek way, I then decided to write the tool. This lack of simple tools is how Drsti (pronounced drishti) was born.

Aren’t you tired of looking at dataframes that looked like they came straight from a 1980 VT100? Sure, if you use notebooks, either standalone or hosted (IBM Watson Studio, Databricks…), you are not (or less) confronted with the issue. However, if you are building pipelines outside of the Data Science toys, oops, tools, you may need to visualize data in a graph.

Read on to see how it works and some of what you can do with Drsti.

Comments closed

CI/CD with Databricks Notebooks and Azure DevOps

Michael Shtelma and Piotr Majer get us started on an MLOps journey:

This is the first part of a two-part series of blog posts that show how to configure and build end-to-end MLOps solutions on Databricks with notebooks and Repos API. This post presents a CI/CD framework on Databricks, which is based on Notebooks. The pipeline integrates with the Microsoft Azure DevOps ecosystem for the Continuous Integration (CI) part and Repos API for the Continuous Delivery (CD).In the second post, we’ll show how to leverage the Repos API functionality to implement a full CI/CD lifecycle on Databricks and extend it to the fully-blown MLOps solution.

Click through for the article and a link to code. You can also see the pipeline YAML (and Python code it calls) in the repo.

Comments closed

Spark Performance Improvements in Azure Synapse

Balaji Sankaran shows improvements Microsoft has made over open-source Apache Spark 3 in Azure Synapse Analytics:

Azure Synapse Analytics is continually focused on delivering a highly performant and scalable platform for supporting Spark Workload. We are focused on improving the query performance for the typical workload patterns that we see with our customers. By combining the latest open-source updates in Apache Spark with our team’s focus on performance updates we have made significant performance gains in standard TPC-DS benchmarking tests.

I expect it will never be as fast as what Databricks can do, but getting a 2x performance improvement over the open source version of Spark is nothing to sneeze at.

Comments closed

Where Kafka Connect Fits

Shivani Sarthi explains the value of Kafka Connect:

Kafka connect is not just a free, open source component of Apache Kafka. But it also works as a centralised data hub for simple data integration between databases, key-value stores etc. The fundamental components include-

– Connectors

– Tasks

– Workers

– Converters

– Transforms

– Dead letter Queue

Moreover it is a framework to stream data in and out of Apache Kafka. In addition, the confluent platform comes with many built-in connectors,used for streaming data to and from different data sources.

Click through for information on each component.

Comments closed

Type 1 SCDs in Delta Lake

Chris Williams starts a series on slowly changing dimensions in a Delta Lake:

Anyone that has contributed towards a Data Warehouse or a dimensional model in Power BI will know the distinction made between the time-series metrics of a Fact Table and the categorised attributes of a Dimension Table. These dimensions are also affected by the passage of time and require revised descriptions periodically which is why they are known as Slowly Changing Dimensions (SCD). See The Data Warehouse Toolkit – Kimball & Ross for more information.

Here is where the Delta Lake comes in. Using its many features such as support for ACID transactions (Atomicity, Consistency, Isolation and Durability) and schema enforcement we can create the same durable SCD’s. This may have required a series of complicated SQL statements in the past to achieve this. I will now discuss a few of the most common SCD’s and show how they can be easily achieved using a few Databricks Notebooks, which are available from my GitHub repo so you can download and have a go:

https://github.com/cwilliams87/Blog-SCDs

Check out the repo, but be sure to read the whole post.

Comments closed