Press "Enter" to skip to content

Category: Hadoop

Creating Your First PySpark Application

Dustin Vannoy gives us a primer on Apache Spark:

Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own PySpark application. Pro tip: Search for the Spark equivalent of functions you use in other programming languages (including SQL). Many will exist in the pyspark.sql.functions module.

In addition to the code listing, Dustin has a video walking us through the process.

Comments closed

Understanding the Kafka Partitioner

Bill Bejeck talks partitions:

Apache Kafka is the de facto standard for event streaming today. Part of what makes Kafka so successful is its ability to handle tremendous volumes of data, with a throughput of millions of records per second, not unheard of in production environments. One part of Kafka’s design that makes this possible is partitioning.  

Kafka uses partitions to spread the load of data across brokers in a cluster, and it’s also the unit of parallelism; more partitions mean higher throughput. Since Kafka works with key-value pairs, getting records with the same key on the same partition is essential.  

Read on to learn a bit about how that partitioning works and why it’s important for application design, especially across multiple programming languages.

Comments closed

Common Challenges Implementing PySpark Code

Amlan Patnaik looks at some common implementation problems:

Pyspark has become one of the most popular tools for data processing and data engineering applications. It is a fast and efficient tool that can handle large volumes of data and provide scalable data processing capabilities. However, Pyspark applications also come with their own set of challenges that data engineers face on a day-to-day basis. In this article, we will discuss some of the common challenges faced by data engineers in Pyspark applications and the possible solutions to overcome these challenges.

Read on for five such challenges.

Comments closed

Optimizing Kafka Infrastructure Costs

Addison Huddy saves some money:

In this first blog, we’re going to run through the infrastructure costs of running Kafka—i.e., compute, storage, networking, and the additional tooling you need to keep Kafka up and running smoothly. We won’t bury the lede—if you’re running Kafka in the cloud across multiple AZs (as most do for high availability), networking likely represents over 50% of your Kafka infrastructure costs. Let’s see how this ends up being the case.

Click through for some thoughts on how to reduce network costs, using AWS as an example.

Comments closed

Kafka Topics and Message Ordering

Francesco Tisiot calls us to order:

One of Apache Kafka’s most known mantras is “it preserves the message ordering per topic-partition,” but is it always true? In this blog post, we’ll analyze a few real scenarios where accepting the dogma without questioning it could result in unexpected and erroneous sequences of messages.

Click through for a dive into what can go wrong with ordering. The good news is, in most cases, exact ordering isn’t critical. For cases in which it is critical, you’re trading off reduced throughput for increased order integrity.

Comments closed

An Overview of the Kappa Architecture

Amian Patnaik provides an overview:

The Kappa Architecture, introduced by Jay Kreps, co-founder of Confluent, is designed to handle real-time data processing in a scalable and efficient manner. Unlike the traditional Lambda Architecture, which separates data processing into batch and stream processing, the Kappa Architecture promotes a single pipeline for both batch and stream processing, eliminating the need for maintaining separate processing pipelines.

What’s interesting to me is that Lambda, an architecture which was an explicit product of its time (in the sense that it was a compromise architecture trying to do two things, the combination of which limited hardware and tooling didn’t allow), is still thriving today. Kappa, meanwhile, isn’t an architectural style that people throw around a lot anymore, at least in the circles I run around in.

Comments closed

Spark ELT in Synapse Notebooks

Liliam Leme performs some data movement:

I often receive various requests from customers while working on FastTrack projects, and I have compiled some examples to help you build your solution on top of a data lake using useful tips. Most of the examples in this post use pandas, and I hope they will be helpful for you as they were for me.

Please note that all examples in this post use pyspark.

In my scenario, I exported multiple tables from SQLDB to a folder using a notebook and ran the requests in parallel.

Read on for the examples and some of the things you can do with Spark notebooks in Azure Synapse Analytics.

Comments closed

Comparing HBase to Cassandra

The Instaclustr team performs a comparison:

Apache HBase® and Apache Cassandra® are both open source NoSQL databases well-equipped to handle incredible amounts of data–but that’s where the similarities end. 

In this blog, discover the architectures powering these technologies, when and how to use them, and which option may prove to be the better choice for your operations.  

Click through for their overview of the two systems and recommendations on when to use which.

Comments closed