Hadoop – Page 25 – Curated SQL

SQL UDFs are simple yet powerful extensions to Spark SQL. As functions, they provide a layer of abstraction to simplify query construction – making SQL queries more readable and modularized. Unlike UDFs that are written in a non-SQL language, SQL UDFs are more lightweight for SQL users to create. SQL function bodies are transparent to the query optimizer thus making them more performant than external UDFs. SQL UDFs can be created as either temporary or permanent functions, be reused across multiple queries, sessions and users, and be access-controlled via Access Control Language (ACL). In this blog, we will walk you through some key use cases of SQL UDFs with examples.

I look forward to dealing with cardinality issues and performance tuning these things in 5 years.

Comments closed

Session Windows in Spark Structured Streaming

Published 2021-10-19 by Kevin Feasel

Jungtaek Lim, et al, announce support for session windows in Spark Structured Streaming:

Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. An input can only be bound to a single window.
Sliding windows are similar to the tumbling windows from the point of being “fixed-sized”, but windows can overlap if the duration of the slide is smaller than the duration of the window, and in this case, an input can be bound to the multiple windows.
Session windows have a different characteristic compared to the previous two types. Session window has a dynamic size of the window length, depending on the inputs. A session window starts with an input and expands itself if the following input has been received within the gap duration. A session window closes when there’s no input received within the gap duration after receiving the latest input. This enables you to group events until there are no new events for a specified time duration (inactivity).

Click through for more details. You could implement session windows when querying existing data using a gaps and islands approach (where you increment the island count when you have a lagged difference greater than the cutoff point), but for streaming scenarios, it’s very handy to have this as a native window type.

Comments closed

Contrasting Kafka with Azure Service Bus

Published 2021-10-18 by Kevin Feasel

Ritam Das explains the differences between Apache Kafka and Azure Service Bus:

It is important to note that Azure Service Bus is a traditional message broker and tailored to somewhat different use cases when compared to Kafka. Simply transferring between these two technologies is not an easy feat and would require overhauling your entire application. The comparison stops at both technologies being message brokers as under the hood they are fundamentally different.
At a high level, ASB has high processing overhead per message, stronger guarantees around delivery and processing, and typically a “process once” model. Kafka has low overhead processing per message, fewer guarantees around delivery and processing, and typically a “publish once, process multiple times” model. To provide an explicit comparison, it would be best to understand the intended use case and proceed from there.

Read on to understand the best uses for each technology, as well as sample calls using Python.

Comments closed

Creating Delta Lake Tables in Azure Databricks

Published 2021-10-18 by Kevin Feasel

Gauri Mahajan takes us through creating new tables in a Delta Lake using Azure Databricks:

Delta lake is an open-source data format that provides ACID transactions, data reliability, query performance, data caching and indexing, and many other benefits. Delta lake can be thought of as an extension of existing data lakes and can be configured per the data requirements. Azure Databricks has a delta engine as one of the core components that facilitates delta lake format for data engineering and performance. Delta lake format is used to create modern data lake or lakehouse architectures. It is also used to build a combined streaming and batch architecture popularly known as lambda architecture.

Click through for the process.

Comments closed

Architecting a Jenkins Replacement

Published 2021-10-18 by Kevin Feasel

Li Haoyi takes us through an internal Databricks tool for continuous integration:

Runbot is a bespoke continuous integration (CI) solution developed specifically for Databricks’ needs. Originally developed in 2019, Runbot incrementally replaces our aging Jenkins infrastructure with something more performant, scalable, and user friendly for both users and maintainers of the service. This blog post will explore the motivations behind developing Runbot, the core design decisions that went into it, and how we used it to greatly improve the experience of all the developers within the Databircks engineering organization.

It doesn’t look like the tool is available externally, but it’s an interesting read and helps understand some of the “why” behind the solution.

Comments closed

Execution Plans in Databricks

Published 2021-10-13 by Kevin Feasel

Falek Miah takes us through the process of creating execution plans in Databricks:

The execution plans in Databricks allows you to understand how code will actually get executed across a cluster and is useful for optimising queries.
It translates operations into optimized logical and physical plans and shows what operations are going to be executed and sent to the Spark Executors.

Read on for more information.

Comments closed

Databricks Integration with Git Repos

Published 2021-10-11 by Kevin Feasel

Ka-Hing Chueng and Vaibhav Sethi announce Databricks Repos is now generally available:

Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.
Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it’s also error-prone.

This has been a bit of a pain point with Databricks in the past, and they’ve come up with this solution. Given that Azure Synapse Analytics has some of the same pain points, I’d expect we’ll see something similar in time.

Comments closed

New in SQL Server Big Data Clusters

Published 2021-10-07 by Kevin Feasel

Daniel Coelho has an update on what’s available in SQL Server Big Data Clusters:

SQL Server Big Data Clusters (BDC) is a capability brought to market as part of the SQL Server 2019 release. Big Data Clusters extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. It is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.
Today, we’re proud to announce the release of the latest cumulative update, CU13, for SQL Server Big Data Clusters which includes important changes and capabilities:

Updating to the most recent production-ready version of Spark (as of today) is a nice upgrade.

Comments closed

pyspark.pandas in Apache Spark 3.2

Published 2021-10-06 by Kevin Feasel

Hyukjin Kwon and Xinrong Meng announce a built-in pandas API for Apache Spark 3.2:

We’re thrilled to announce the pandas API as part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users can leverage the pandas API on their existing Spark clusters.
A few years ago, we launched Koalas, an open source project that implements the pandas DataFrame API on top of Spark, which became widely adopted among data scientists. Recently, Koalas was officially merged into PySpark by SPIP: Support pandas API layer on PySpark as part of Project Zen (see also Project Zen: Making Data Science Easier in PySpark from Data + AI Summit 2021).
pandas users can now scale their workloads with one simple line change in the upcoming Spark 3.2 release:

Click through to see more details on the change.

Comments closed

foreach() vs foreachPartition() in Spark

Published 2021-10-06 by Kevin Feasel

The Hadoop in Real World team contrasts two functions:

foreach() and foreachPartition() are action function and not transform function. Both functions, since they are actions, they don’t return a RDD back.

Read on for the big difference between the two.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop

SQL User-Defined Functions in Spark SQL

Session Windows in Spark Structured Streaming

Contrasting Kafka with Azure Service Bus

Creating Delta Lake Tables in Azure Databricks

Architecting a Jenkins Replacement

Execution Plans in Databricks

Databricks Integration with Git Repos

New in SQL Server Big Data Clusters

pyspark.pandas in Apache Spark 3.2

foreach() vs foreachPartition() in Spark