Press "Enter" to skip to content

Category: Hadoop

sparklyr 1.5 Released

Yitao Li announces version 1.5 of sparklyr:

A large fraction of pull requests that went into the sparklyr 1.5 release were focused on making Spark dataframes work with various dplyr verbs in the same way that R dataframes do. The full list of dplyr-related bugs and feature requests that were resolved in sparklyr 1.5 can be found in here.

In this section, we will showcase three new dplyr functionalities that were shipped with sparklyr 1.5.

Read on to learn more about this update. H/T R-Bloggers

Comments closed

Running Spark on Azure Kubernetes Service

Tsuyoshi Matsuzaki walks us through running Apache Spark on Azure Kubernetes Service:

Apache Spark officially includes Kubernetes support, and thereby you can run a Spark job on your own Kubernetes cluster. (See here for official document. Note that Kubernetes scheduler is currently experimental.)
Especially in Microsoft Azure, you can easily run Spark on cloud-managed Kubernetes, Azure Kubernetes Service (AKS).

In this post, I’ll show you step-by-step tutorial for running Apache Spark on AKS. In this tutorial, artifacts, such as, source code, data, and container images are all protected by Azure credentials (keys).

Although managed services for Apache Spark, such as, Azure Databricks, Azure Synapse Analytics, and Azure HDInsight, is the best place to run Spark workloads, you will get much flexibility by running workloads on managed Kubernetes (AKS) – such as, spot VM support, start/stop cluster, confidential computing (Intel SGX) support, so on and so forth.

Read on to see how. Though of these options, I’d probably choose Azure Databricks or Azure Synapse Analytics well before the others.

Comments closed

Using Koalas with Azure Databricks

Tomaz Kastrun continues a series on Azure Databricks:

So far, we looked into SQL, R and Python and this post will be about Python Koalas package. A special implementation of pandas DataFrame API on Apache Spark. Data Engineers and data scientist love Python pandas, since it makes data preparation with pandas easier, faster and more productive. And Koalas is a direct “response” to make writing and coding on Spark, easier and more familiar. Also follow the official documentation with full description of the package.

Click through for a quick demo.

Comments closed

Running Kafka on Windows (via WSL2)

Jim Galasyn shows how you can try out Apache Kafka on Windows:

Is Windows your favorite development environment? Do you want to run Apache Kafka® on Windows? Thanks to the Windows Subsystem for Linux 2 (WSL 2), now you can, and with fewer tears than in the past. Windows still isn’t the recommended platform for running Kafka with production workloads, but for trying out Kafka, it works just fine. Let’s take a look at how it’s done.

You can also get Kafka to run natively on Windows, though there are bugs around file handling, to the point where if you restart your machine while the Kafka service is running, data in partitions may become permanently inaccessible and force you to delete it before you can start Kafka again. So yeah, it’s better to use WSL or Docker containers for trying out Kafka on Windows machines.

Comments closed

Moving Away from the Lambda Architecture

Xiang Zhang and Jingyu Zhu talk about migrating a project away from the Lambda architecture:

The Lambda architecture has become a popular architectural style that promises both speed and accuracy in data processing by using a hybrid approach of both batch processing and stream processing methods. But it also has some drawbacks, such as complexity and additional development/operational overheads. One of our features for Premium members on LinkedIn, Who Viewed Your Profile (WVYP), relied on a Lambda architecture for some time. The backend system supporting this feature had gone through a few architectural iterations in the past years: it started as a Kafka client processing a single Kafka topic, and eventually evolved to a Lambda architecture with more complicated processing logic. However, in an effort to pursue faster product iteration and lower operational overheads, we recently underwent a transition to make it Lambda-less. In this blog post, we’ll share some of the lessons learned in operating this system in the Lambda architecture, the decisions made in transitioning to Lambda-less, and the shifts necessary to undergo this transition.

When Lambda was first proposed back in 2015, it was intended as a compromise architecture trying to solve several important problems with the tools available in 2015 (well, 2013 and 2014—it was in a book, after all). I could definitely see the architecture fall into disuse within the next decade, not because it was at all bad, but because the world around it changed to the point that there is a better compromise available.

Comments closed

Apache Flink 1.12.0 Released

Marta Paes and Aljoscha Krettek announce a new release of Apache Flink:

– The community has added support for efficient batch execution in the DataStream API. This is the next major milestone towards achieving a truly unified runtime for both batch and stream processing.

Kubernetes-based High Availability (HA) was implemented as an alternative to ZooKeeper for highly available production setups.

– The Kafka SQL connector has been extended to work in upsert mode, supported by the ability to handle connector metadata in SQL DDL. Temporal table joins can now also be fully expressed in SQL, no longer depending on the Table API.

– Support for the DataStream API in PyFlink expands its usage to more complex scenarios that require fine-grained control over state and time, and it’s now possible to deploy PyFlink jobs natively on Kubernetes.

Read on for more details on these as well as other changes.

Comments closed

Ignoring Bad Dates when Moving to Spark 3

Robert Blackburn shows us one way to handle bad dates when moving to Spark 3:

Moving from a Spark 2 to a Spark 3 runtime has a lot of benefits including big performance improvements through adaptive query executiondynamic partition pruning, and other optimizations. Some updates may require you to refactor your code. One of them is Delta tables now use the Proleptic Gregorian Calendar. Isn’t a calendar a calendar? Unfortunately, no. The Julian calendar has discrepancies with old dates. Specifically dates before 1582 and timestamps before 1900. Here we will dynamically update these dates for incoming source files.

If you would like to follow along in detail, I have a sample notebook that uses the community edition of Databricks. The DBC Archive file is here and the source file is here.

Fortunately, this change is unlikely to affect most of us, with perhaps the most common issue being that you used 0001-01-01 as a default date.

Comments closed

Using Notebooks to Load Data into the Databricks File System

Tomaz Kastrun is putting together an Advent of Azure Databricks:

Yesterday we started working towards data import and how to use drop zone to import data to DBFS. We have also created our first Notebook and this is where I would like to start today. With a light introduction to notebooks.

Read on for a depiction of notebooks, as well as an example which loads data into the Databricks File System (DBFS).

Comments closed

Joining Data Streams in Flink

Kundan Kumarr crosses the streams:

Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like UnionJoin, and so on. In this blog, we will explore the Window Join operator in Flink with an example. It joins two data streams on a given key and a common window.

Click through for an example of the fluent API approach. It’s not as nice as proper SQL, but it does the job.

Comments closed