Press "Enter" to skip to content

Category: Hadoop

RDDs, DataFrames, and Datasets in Spark

Brad Llewellyn walks us through the three key data structures in Apache Spark:

We see that creating an RDD can be done with one easy function.  In this snippet, sc represents the default SparkContext.  This is extremely important, but is better left for a later post.  SparkContext offers the .textFile() function which creates an RDD from a text file, parsing each line into it’s own element in the RDD.  These lines happen to represent CSV records.  However, there are many other common examples that use lines of free text.

It should also be noted that we can use the .collect() and .take() functions to view the contents of an RDD.  The difference between .collect() and .take() is that .take() allows us to specify the number of elements we want to retrieve, whereas .collect() returns the entire RDD.

My tendencies are probably skewed pretty heavily, but I live in DataFrames and almost never use raw RDDs anymore.

Comments closed

Flink’s State Processor API

Seth Wiesman and Fabian Hueske show off Apache Flink’s State Processor API:

The State Processor API that comes with Flink 1.9 is a true game-changer in how you can work with application state! In a nutshell, it extends the DataSet API with Input and OutputFormats to read and write savepoint or checkpoint data. Due to the interoperability of DataSet and Table API, you can even use relational Table API or SQL queries to analyze and process state data.

For example, you can take a savepoint of a running stream processing application and analyze it with a DataSet batch program to verify that the application behaves correctly. Or you can read a batch of data from any store, preprocess it, and write the result to a savepoint that you use to bootstrap the state of a streaming application. It’s also possible to fix inconsistent state entries now. Finally, the State Processor API opens up many ways to evolve a stateful application that were previously blocked by parameter and design choices that could not be changed without losing all the state of the application after it was started. For example, you can now arbitrarily modify the data types of states, adjust the maximum parallelism of operators, split or merge operator state, re-assign operator UIDs, and so on

Read on to learn more about how this works.

Comments closed

Derivative Event Sourcing

Anna McDonald explains the concept of derivative event sourcing:

If you happen to be the proud owner of a single order service, then you are all set to begin.

But what if you have more than one order service?

Something that tends to happen at companies that have been around for more than a sprint is the accumulation of technical debt. Sometimes that debt takes the form of duplicate applications. Mergers happen and you adopt other applications that, for reasons beyond your control, cannot be retired or rewritten right away. In other words, sometimes you end up with more than one order service—enter derivative event sourcing!

This is a nice article for real-life scenarios where you don’t get to build nice, well-designed services from scratch.

Comments closed

LSTM in Databricks

Vedant Jain shows us an example of solving a multivariate time series forecasting problem using LSTM networks:

LSTM is a type of Recurrent Neural Network (RNN) that allows the network to retain long-term dependencies at a given time from many timesteps before. RNNs were designed to that effect using a simple feedback approach for neurons where the output sequence of data serves as one of the inputs. However, long term dependencies can make the network untrainable due to the vanishing gradient problem. LSTM is designed precisely to solve that problem.

Sometimes accurate time series predictions depend on a combination of both bits of old and recent data. We have to efficiently learn even what to pay attention to, accepting that there may be a long history of data to learn from. LSTMs combine simple DNN architectures with clever mechanisms to learn what parts of history to ‘remember’ and what to ‘forget’ over long periods. The ability of LSTM to learn patterns in data over long sequences makes them suitable for time series forecasting.

This is a nice overview and as a bonus, there’s a notebook as well where you can try it on your own.

Comments closed

Databricks versus Mapping Data Flows

Helge Rege Gardsvoll contrasts Azure Databricks, Azure Data Factory Mapping Data Flows, and SQL Server Integration Services:

Mapping Data Flows
One of the many data flows from Microsoft these days providing, for the first time, data transformation capabilities within Data Factory. This is not a U-SQL script or Databricks notebook that is orchestrated from Data Factory, but a tool integrated. This means that you can reuse (many of) the datasets you have defined in Data Factory, while in Databricks you don’t.

Mapping Data Flows runs on top of Databricks, but the cluster is handled for you and you don’t have to write any of that Scala code yourself.

Read on for the full comparison.

Comments closed

Develop BDC PySpark Jobs in Visual Studio Code

Jenny Jiang announces a new capability in Visual Studio Code:

With the Visual Studio Code extension, you can enjoy native Python programming experiences such as linting, debugging support, language service, and so on. You can run current linerun selected lines of code, or run all for your PY file. You can import and export a .ipynb notebook and perform a notebook like query including Run Cell, Run Above, or Run Below. You can also enjoy a notebook like interactive experience that includes your source code and markdown comments along with the running results and output. You can remove the unneeded sections, enter comments, or type additional code in the interactive results window. Moreover, you can visualize your results in a graphic format through a matplotlib like Jupyter Notebook. The integration with SQL Server 2019 Big Data Clusters empowers you to quickly submit a PySpark batch job to the big data cluster and monitor job progress.

This is rather useful for developers, though I greatly prefer the Azure Data Studio notebook interface.

Comments closed

Tuning YARN

Dmitry Tolpeko helps us tune YARN settings:

Sometimes it may take a few iterations to find the proper container size, but usually it helps and the query succeeds.

But what if you set the container size 4096 MB or 8192 MB but the query could complete successfully even with 2048 MB?

Read on to learn more.

Comments closed

PolyBase and Dockerized Hadoop

I have a solution to a problem which vexed me for quite some time:

Quite some time ago, I posted about PolyBase and the Hortonworks Data Platform 2.5 (and later) sandbox.

The summary of the problem is that data nodes in HDP 2.5 and later are on a Docker private network. For most cases, this works fine, but PolyBase expects publicly accessible data nodes by default—one of its performance enhancements with Hadoop was to have PolyBase scale-out group members interact directly with the Hadoop data nodes rather than having everything go through the NameNode and PolyBase control node.

Click through for the solution.

Comments closed

Null Checks in Spark DataFrames

Bipin Patwardhan gives us four techniques for validating whether data in Spark exists:

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column ‘null’). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named “df.”

Click through for those techniques.

Comments closed

Paired RDDs in Spark

Ramandeep Kaur explains how Paired Resilient Distributed Datasets (PairRDDs) differ from regular RDDs:

So, assuming that you have a fair idea about what Spark is and the basics of RDDs. Paired RDD is one of the kinds of RDDs. These RDDs contain the key/value pairs of data. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.

When datasets are described in terms of key/value pairs, it is common to want to aggregate statistics across all elements with the same key.

Paired RDDs bring us back to that key-value pair paradigm which Hadoop’s version of MapReduce brought to the forefront.

Comments closed