Category: Hadoop

Generating Artificial Data in Databricks

Published 2023-03-29 by Kevin Feasel

While attending the SQLBits 2023, I took part in André Kamman’s session about “Generate test data quick, easy and lots of it with the Databricks Labs Data Generator”.

In this blog, I will share with you my insights about the DataBricks Data Generator library and I’ll give an example.

Synthetic data is a valuable resource for data scientists, engineers, and analysts who need to test, benchmark, or demonstrate their solutions without compromising sensitive or confidential information. However, generating realistic and relevant synthetic data can be challenging and time-consuming.

That’s why Databricks Labs has developed a Python library called dbldatagen that can help you create large-scale synthetic data sets using Spark.

Click through to learn more about the library and see how you can use to to generate arbitrary amounts of artificial data following certain constraints.

Comments closed

Testing Message Ordering in Kafka

Published 2023-03-24 by Kevin Feasel

Francesco Tisiot puts a claim to the test:

One of Apache Kafka®’s most known mantras is “it preserves the message ordering per topic-partition”, but is it always true? In this blog post we’ll analyze a few real scenarios where accepting the dogma without questioning it could result in unexpected, and erroneous, sequences of messages.

There’s a lot more to this than I realized, and Francesco does a great job of explaining it.

Comments closed

An Overview of Kafka Streams

Published 2023-03-21 by Kevin Feasel

The Instaclustr team explains how stream processing works in Kafka Streams:

Kafka Streams is a client library providing organizations with a particularly efficient framework for processing streaming data. It offers a streamlined method for creating applications and microservices that must process data in real-time to be effective. Using the Streams API within Apache Kafka, the solution fundamentally transforms input Kafka topics into output Kafka topics. The benefits are important: Kafka Streams pairs the ease of utilizing standard Java and Scala application code on the client end with the strength of Kafka’s robust server-side cluster architecture.

Read on for an overview of how it works. And if you haven’t already, check out the prior post on Kafka so that you can experience the same slight mental perturbations I did when reading about “real-time” responses.

Comments closed

Databricks Power Tools in VS Code

Published 2023-03-21 by Kevin Feasel

Gerhard Brueckl has some tools for us:

As you probably know, we at paiqo have developed our Databricks extension for VSCode over the last years and are constantly adding new features and improving user experience. The most notable features are probably the execution of local notebooks against a Databricks cluster, a nice UI to manage clusters, jobs, secrets, repos, etc. and last but not least also a browser for your workspace and DBFS to sync files locally.

In February 2023 Databricks also published its own official VSCode extension which was definitely long awaited by a lot of customers (blog, extension). It allows you to run a local file on a Databricks cluster and display the results in VSCode again. Alternatively you can also run the code as a workflow. I am sure we can expect much more features in the near future and Databricks investing in local IDE support is already a great step forward!

As you can imagine, I am working very closely with the people at Databricks and we are happy to also announce the next major release of our Databricks VSCode extension 2.0 which now also integrates with the official Databricks extension! To avoid confusion between the two extensions we also renamed ours to Databricks Power Tools so from now on you will see two Databricks icons on the very left bar in VSCode.

Click through to read more in the announcement and some of the things which have changed as a result of version 2.0.

Comments closed

Real-Time Data Streaming and Apache Kafka

Published 2023-03-21 by Kevin Feasel

Kai Waehner explains how Apache Kafka is not real-time:

Real-time data beats slow data. It is that easy! But what is real-time? The term always needs to be defined when discussing a use case. Apache Kafka is the de facto standard for real-time data streaming. Kafka is good enough for almost all real-time scenarios. But dedicated proprietary software is required for niche use cases. Kafka is NOT the right choice if you need microsecond latency! This article explores the architecture of NASDAQ that combines critical stock exchange trading with low-latency streaming analytics.

Kai uses the much more appropriate term “near real-time,” which I agree with. My mental example of “real-time” is software that you’d put on a fighter jet (which was an actual example in my undergrad days of a real-time operating system). If people potentially die because your software takes 4 milliseconds to do a job it needs to do in 100 microseconds, that’s real-time. For most of us, near real-time is certainly enough.

Actually, I’d go one step further: for most of us, not-really-real-time is fine. So many cases of “The users needs this data in real time!” boil down to “The users really only look at this once a day and couldn’t act on faster information and some of our data sources only update once a day.” Swap ‘once a day’ with ‘once an hour’ or something like that and you have the large majority of projects which started out with “near real-time” requirements.

1 Comment

The Legacy of Big Data

Published 2023-03-20 by Kevin Feasel

Adam Bellemare looks back:

Big Data was going to change the way everything worked. We were about to solve every financial, medical, scientific, and social problem known to humankind. All it would take was a great big pile of data and some way to process it all.

But somewhere along the line, the big data revolution just sort of petered out, and today you barely hear anything about big data.

Click through for Adam’s explanation, which is a more detailed form of “Some stuff worked out and became ubiquitous in other ways; others fell off the map.”

But I’m going to snag one more quotation here from Adam:

And finally, big data has shown us that no matter how hard we try, there’s simply no escaping from the inevitable convergence to a full SQL API.

Me: Laughs in Feasel’s Law.

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

1 Comment

Spark Application Dependency Caching

Published 2023-03-10 by Kevin Feasel

Shu Wang, Biao He, and Minchu Yang talk turkey about dependencies:

In this blog post, we will share our analysis of Spark Dependency Management at LinkedIn, highlight interesting findings, and present our design choice of using a simple user-level cache over more complex alternatives. We will also discuss our rollout experience and lessons learned. Finally, we will demonstrate the impact of accelerating all Spark applications at LinkedIn at the cluster level. Faster Spark jobs translate to increased data freshness, leading to an enhanced member experience by way of more relevant recommendations, timely insights, effective abuse protection, and other similar improvements.

If you work with Spark to any serious extent, you’ll want to read this post.

Comments closed

Working with Kafka from Python

Published 2023-03-03 by Kevin Feasel

Dave Shook has a new course for us:

If you’re a Python developer, our free Apache Kafka for Python Developers course will show you how to harness the power of Kafka in your applications. You will learn how to build Kafka producer and consumer applications, how to work with event schemas and take advantage of Confluent Schema Registry, and more. Follow along in each module as Dave Klein, Senior Developer Advocate at Confluent, covers all of these topics in detail. Hands-on exercises occur throughout the course to solidify concepts as they are presented. At its end, you will have the knowledge you need to begin developing Python applications that stream data to and from Kafka clusters.

Read on to learn more about it and give it a try.

Comments closed

Unit Testing Spark Notebooks in Synapse

Published 2023-03-03 by Kevin Feasel

Arun Sethia grabs the oscilloscope:

In this blog post, we will cover how to test and create unit test cases for Spark jobs developed using Synapse Notebook. This is an extension of my previous blog, Synapse – Choosing Between Spark Notebook vs Spark Job Definition, where we discussed selecting between Spark Notebook and Spark Job Definition. Unit testing is an automated approach that developers use to test individual self-contained code units. By verifying code behavior early, it helps to streamline coding practices for larger systems.

Arun covers three major use cases: when your code is in an external library, when it is in a separate notebook, and when it is in the same notebook.

Comments closed

Tips for Kafka Streams Developers

Published 2023-02-23 by Kevin Feasel

Ludovic Dehon shares some advice:

We built Kestra, an open-source data orchestration and scheduling platform, and we decided to use Kafka as the central datastore to build a scalable architecture. We rely heavily on Kafka Streams for most of our services (the executor and the scheduler) and have made some assumptions on how it handles the workload.

However, Kafka has some restrictions since it is not a database, so we need to deal with the constraints and adapt the code to make it work with Kafka. We will cover topics, such as using the same Kafka topic for source and destination, and creating a custom joiner for Kafka Streams, to ensure high throughput and low latency while adapting to the constraints of Kafka and making it work with Kestra.

Click through for several tips.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31