Spark – Page 10 – Curated SQL

Microsoft Fabric Notebooks and Compute Limits

Published 2023-08-15 by Kevin Feasel

In this case, my notebook threw an error at me but the command seemed to finish without any issue. Sounds vague? It did to me. The notebookcell I tried to run had a lot of stuff happening at the same time.

As you can see in the above screenshot, the status shows green checkmarks but there’s an error as well. The error message was not really clear to me, but that can really be me lack of deep level experience. So, I logged a call with Microsoft Support and see what they could come up with.

I’ve had enough experience with Spark to see the issue and figure the response, but click through for the screenshot and what Reitse did to resolve the issue.

Comments closed

Creating a Simple Date Dimension in Databricks

Published 2023-08-10 by Kevin Feasel

Chen Hirsh builds a table:

A date dimension is extremely useful and is required by most BI applications. This kind of dimension has a key of time level (day, month, etc.), and attributes that describe it such as year, month, etc. In your BI model, you join this dimension to facts on their date fields, to aggregate from day level to week, month, and year.

In this post, I will demonstrate how to create a date dimension on Azure Databricks using Python. A link to the complete Databricks notebook is at the end of the post.

Check out the code, as well as explanation, in that post.

Comments closed

Contrasting Spark and Flink for Streaming Use Cases

Published 2023-07-31 by Kevin Feasel

Deepthi Mohan and Karthi Thyagarajan contrast two products:

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink shines in its ability to handle processing of data streams in real-time and low-latency stateful computations. Both support a variety of programming languages, scalable solutions for handling large amounts of data, and a wide range of connectors. Historically, Spark started out as a batch-first framework and Flink began as a streaming-first framework.

In this post, we share a comparative study of streaming patterns that are commonly used to build stream processing applications, how they can be solved using Spark (primarily Spark Structured Streaming) and Flink, and the minor variations in their approach. Examples cover code snippets in Python and SQL for both frameworks across three major themes: data preparation, data processing, and data enrichment. If you are a Spark user looking to solve your stream processing use cases using Flink, this post is for you. We do not intend to cover the choice of technology between Spark and Flink because it’s important to evaluate both frameworks for your specific workload and how the choice fits in your architecture; rather, this post highlights key differences for use cases that both these technologies are commonly considered for.

Read on for an analysis of the two products.

Comments closed

A Primer on Databricks Unity Catalog

Published 2023-07-03 by Kevin Feasel

Beginner’s Hadoop gives us an overview:

The Databricks Unity Catalog is a feature provided by Databricks Unified Data Analytics Platform that allows you to organize and manage metadata about your data assets, such as tables, databases, and views. It provides a centralized metadata repository that enables users to discover, understand, and collaborate on data assets within a Databricks environment. The Unity Catalog integrates with various data sources and supports different metadata management capabilities.

Read on for an overview of what it does.

Comments closed

Bring Fabric to the Data Lakehouse

Published 2023-06-29 by Kevin Feasel

Ust Oldfield ties together Databricks and Microsoft Fabric:

We’ve built countless Lakehouses for our customers and influenced the design of many more. With the advent of Fabric, many organisations with existing lakehouse implementations in Azure are wondering what changes Fabric will herald for them. Do they continue with their existing lakehouse implementation and design, or do they migrate entirely to Fabric?

For many, the answer will be to continue as-is. They’ve invested a lot of time and money in establishing a Lakehouse – to migrate now to a slightly different technology stack would be a very costly exercise! There also isn’t a need to migrate from a lakehouse implementation in Databricks to one in Fabric as there aren’t concrete benefits to be realised.

For those using Power BI as their semantic and reporting layers, as well as using Databricks SQL or Synapse Serverless as the serving layer, Fabric provides a perfect opportunity to rationalise the architecture and to bring about substantial performance gains through the Direct Lake connectivity and V-Order compression in Fabric.

Read on to see what Ust means, using a couple of architecture diagrams along the way.

Comments closed

Apache Doris and Data Colocation

Published 2023-06-29 by Kevin Feasel

Frank Z takes us through a use case for Apache Doris:

In data analytics, fast query performance is more of a result than a guarantee. What’s more important than the result itself is the architectural design and mechanism that enables quick performance. This is exactly what this post is about. I will put you into context with a typical use case of Apache Doris, an open-source MPP-based analytic database.

The user, in this case, is an all-category Q&A website. As a billion-dollar listed company, they have their own data management platform. What Doris does is support the data filtering, packaging, analyzing, and monitoring workloads of that platform. Based on their huge data size, the user demands quick data loading and quick response to queries.

This sounds a lot like sharding of the data, where you segregate data for a particular customer/entity into its own database (and possibly instance), with the exception that queries are expected to go over a number of shards rather than focus on a single one.

Comments closed

Read and Write Data with PySpark

Published 2023-06-22 by Kevin Feasel

Dustin Vannnoy has two of the three R’s down:

Every Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk through some of the most common format and cloud storage locations for reading and writing with Spark. We’ll save some of the advanced Delta Lake capabilities for another tutorial.

Click through to see how to read from and write to CSV, JSON, and Parquet formats. Dustin has examples of working with Azure Blob Storage, S3, and Google Cloud Storage, and even some database examples with JDBC.

Comments closed

Creating a Spark UDF

Published 2023-06-13 by Kevin Feasel

The Big Data in Real World team creates a Spark user-defined function in Scala:

In this post we are going to create a Spark UDF which converts temperature from Fahrenheit to Celsius.

Here is our data. We have day and temperature in Fahrenheit.

And, of course, it’s roughly the same in PySpark. Also, note that user-defined function performance will take a hit, and that answers are fairly consistent through the years, so save these for when you need them.

Comments closed

Databricks SQL Performance Tuning

Published 2023-05-30 by Kevin Feasel

Katie Cummiskey provides some tips for us:

We previously discussed how to use Power BI on top of Databricks Lakehouse efficiently. However, the well-designed and efficient Lakehouse itself is the basement for overall performance and good user experience. We will discuss recommendations for physical layout of Delta tables, data modeling, as well as recommendations for Databricks SQL Warehouses.

These tips and techniques proved to be efficient based on our field experience. We hope you will find them relevant for your Lakehouse implementations too.

Read on for these tips.

Comments closed

Adding Count to a Grouped DataFrame in Spark

Published 2023-05-25 by Kevin Feasel

The Big Data in Real World team does some counting:

We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.

One important thing to remember about Spark transformations is that they’re lazy: just because you ran df.groupBy(...).agg(...) doesn’t mean the new DataFrame exists yet, so until you call the show() action (or whatever), the original data is still there for the taking, which is how you can reference it again later in the chained statement.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Spark