Hadoop – Page 9 – Curated SQL

With the advent of service mesh and containerized applications, the idea of the control and data plane has become popular. A part of your application infrastructure, such as a proxy or sidecar, is dedicated to aspects such controlling traffic, access, governance, security, and monitoring and is referred to as the control plane. Another part of your application infrastructure that is used purely for processing your business transactions is referred to as the data plane.

Read on to see how the concept works at an architectural level.

Comments closed

Working with Kafka using .NET 6

Published 2023-02-06 by Kevin Feasel

Jaydeep Patil adds a NuGet package:

We will go over Apache Kafka basics, installation, and operation, as well as a step-by-step implementation using a .NET Core 6 web application.

Click through for a demo-heavy walkthrough. And if you’re interested in ksqldb or Kafka Streams, also check out Tomas Fabian’s library. Both are under active development and each forms a piece of the whole Kafka solution.

Comments closed

Flink 1.16.1 Release

Published 2023-02-03 by Kevin Feasel

Martijn Visser announces Apache Flink version 1.16.1:

The Apache Flink Community is pleased to announce the first bug fix release of the Flink 1.16 series.

This release includes 84 bug fixes, vulnerability fixes, and minor improvements for Flink 1.16. Below you will find a list of all bugfixes and improvements (excluding improvements to the build infrastructure and build stability). For a complete list of all changes see: JIRA.

We highly recommend all users upgrade to Flink 1.16.1.

Read on for the release notes, including links to all of the closed tickets.

Comments closed

Flink Table Store 0.3

Published 2023-01-18 by Kevin Feasel

Jingsong Lee announces a new version of Flink Table Store:

Sometimes users only care about aggregated results. The aggregation merge engine aggregates each value field with the latest data one by one under the same primary key according to the aggregate function.

Each field that is not part of the primary keys must be given an aggregate function, specified by the fields.<field-name>.aggregate-function table property.

Read on for the full changeset.

Comments closed

Spark Structured Streaming with Synapse

Published 2023-01-16 by Kevin Feasel

Ryan Adams builds a demo:

In this post we are going to look at an example of streaming IoT temperature data in Synapse Spark. I have an IoT device that will stream temperature data from two sensors to IoT hub. We’ll use Synapse Spark to process the data, and finally write the output to persistent storage. Here is what our architecture will look like:

Click through for the architectural diagram and step-by-step on how to put the demo together.

Comments closed

Parallel Loading in Spark Notebooks

Published 2023-01-10 by Kevin Feasel

Dustin Vannoy answers some questions:

I received many questions on my tutorial Ingest tables in parallel with an Apache Spark notebook using multithreading. In this video and post I address some of the questions that I couldn’t just answer in the YouTube comments. Watch the video for more complete answers but here are quick responses with links to examples where appropriate.

Click through for the video and some text versions. Dustin includes examples for Synapse and Databricks.

Comments closed

Executing Multiple Notebooks in one Spark Pool with Genie

Published 2023-01-06 by Kevin Feasel

Shalu Ganotra Chadra, et al, explain what Synapse Genie is:

The Genie framework is a metadata driven utility written in Python. It is implemented using threading (ThreadPoolExecutor module) and directed acyclic graph (Networkx library). It consists of a wrapper notebook, that reads metadata of notebooks and executes them within a single Spark session. Each notebook is invoked on a thread with MSSparkutils.run() command based on the available resources in the Spark pool. The dependencies between notebooks are understood and tracked through a directed acyclic graph.

Read on for more information about how you can use it and what the setup process looks like.

Comments closed

Converting Spark RDDs to DataFrames and Datasets

Published 2023-01-02 by Kevin Feasel

Ashish Chaudhary does a bit of swapping around:

In this blog, we will be talking about Spark RDD, Dataframe, Datasets, and how we can transform RDD into Dataframes and Datasets.

At this point, most of the libraries I know of accept and produce DataFrames. Occasionally you might need to “downshift” to an RDD to work with some specialty library. But in the event you do have one but want to get to another, Ashish has you covered.

Comments closed

Join Types in Spark SQL

Published 2022-12-29 by Kevin Feasel

Rituraj Khare makes some connections:

In Apache Spark, we can use the following types of joins in SQL:

Inner join: An inner join in Apache Spark is a type of join that returns only the rows that match a given predicate in both tables. To perform an inner join in Spark using Scala, we can use the join method on a DataFrame.

The set of options is the same as you’d see in a relational database: inner, left outer, right outer, full outer, and cross. The examples here are in Scala, though would apply just as easily to PySpark and, of course, writing classic SQL statements.

Comments closed

External Objects in Databricks Unity Catalog

Published 2022-12-29 by Kevin Feasel

Meagan Longoria adds external tables and views to an Azure Databricks Unity Catalog:

I’ve been busy defining objects in my Unity Catalog metastore to create a secure exploratory environment for analysts and data scientists. I’ve found a lack of examples for doing this in Azure with file types other than delta (maybe you’re reading this in the future and this is no longer a problem, but it was when I wrote this). So I wanted to get some more examples out there in case it helps others.

I’m not storing any data in Databricks – I’m leaving my data in the data lake and using Unity Catalog to put a tabular schema on top of it (hence the use of external tables vs managed tables. In order to reference an ADLS account, you need to define a storage credential and an external location.

Read on for examples of what you can do with this.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop

Kafka Control and Data Planes

Working with Kafka using .NET 6

Flink 1.16.1 Release

Flink Table Store 0.3

Spark Structured Streaming with Synapse

Parallel Loading in Spark Notebooks

Executing Multiple Notebooks in one Spark Pool with Genie

Converting Spark RDDs to DataFrames and Datasets

Join Types in Spark SQL

External Objects in Databricks Unity Catalog