Press "Enter" to skip to content

Category: Hadoop

Spark DataFrames

Tomaz Kastrun continues a series on working with Apache Spark. Part 10 looks at the DataFrame construct:

We have looked in datasets and seen that a dataset is distributed collection of data. A dataset can be constructer from JVM object and later manipulated with transformation operations (e.g.: filter(), Map(),…). API for these datasets are also available in Scala and in Java. But in both cases of Python and R, you can also access the columns or rows from datasets.

On the other hand, dataframe is organised dataset with named columns. It offers much better optimizations and computations and still resembles a typical table (as we know it from database world). Dataframes can be constructed from arrays or from matrices from variety of files, SQL tables, and datasets (RDDs). Dataframe API is available in all flavours: Java, Scala, R and Python and hence it’s popularity.

Part 11 looks at external R and Python packages and DataFrame support:

When you install Spark, the extension of not only languages but also other packages, systems is huge. For example with R, not only that you can harvest the capabilities of distributed and parallel computations, you can also extend the use of R language.

Part 12 gets into Spark SQL:

Spark SQL is a one of the Spark modules for structured data processing and analysing. Spark provides Spark SQL and also API for execution of SQL queries. Spark SQL can read data from Hive instance, but also from datasets and dataframe. The communication between Spark SQL and execution engine will always result in a dataset or datafrane.

These formats are interchangeable. So interacting with SQL against result from a different API is possible, respectively. Plugging in the Java JDBD or standard ODBC drivers will also give your SQL interface access to different sources. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

With API unification, user can access Spark SQL using Scala spark-shell, using Python pyspark or using R sparkR shell.

DataFrames are so popular that they’ve become the de facto standard for working with data in Spark, and .NET languages only work with DataFrames, not with the raw RDDs.

Comments closed

Learning about RDDs in Spark

Tomaz Kastrun continues a series on Spark. Part 7 ties in R and gives us sample plotting in R and Python:

Let’s look into the local use of Spark. For R language, sparklyr package is availble and for Python pyspark is availble.

Part 8 gets us into the key data structure behind Spark’s success, the Resilient Distributed Dataset:

Spark is created around the concept of resilient distributed datasets (RDD). RDD is a fault-tolerant collection of files that can be used in parallel. RDDs can be created in two ways:
– parallelising an existing data collection in driver program
– referencing a datasets in external storage (HDFS, blob storage, shared filesystem, Hadoop InputFormat,…)

In a simple way, Spark RDD has two opeartions:
– transformations – creating a new RDD dataset on top of already existing one with the last transformation
– actions – to the action, and return a value to the driver program after running a computation on the dataset.

Part 9 looks a bit more at transformations and actions:

Two types of operations are available with RDD; transformations and actions. Transformations are lazy operations, meaning that they prepare the new RDD with every new operation but now show or return anything. We can say, that transformations are lazy because of updating existing RDD, these operations create another RDD. Actions on the other hand trigger the computations on RDD and show (return) the result of transformations.

Most modern work in Spark won’t directly use RDDs, though everything is built on top of them and it’s good to understand the foundation even if you don’t need to write all of those map(), fold(), and reduceByKey() operations yourself.

Comments closed

Deploying dbt on Databricks

Dave Eyler, et al, have a great announcement:

At Databricks, nothing makes us happier than making our users more productive, which is why we are delighted to announce a native adapter for dbt. It’s now easier than ever to develop robust data pipelines on Databricks using SQL.

dbt is a popular open source tool that lets a new breed of ‘analytics engineer’ build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple.

Click through for more information on how this works and how you can get the native adapter.

Comments closed

Using Scala at Databricks

Li Haoyi gives us a peek behind the curtain:

With hundreds of developers and millions of lines of code, Databricks is one of the largest Scala shops around. This post will be a broad tour of Scala at Databricks, from its inception to usage, style, tooling and challenges. We will cover topics ranging from cloud infrastructure and bespoke language tooling to the human processes around managing our large Scala codebase. From this post, you’ll learn about everything big and small that goes into making Scala at Databricks work, a useful case study for anyone supporting the use of Scala in a growing organization.

It’s always interesting to see how the largest companies handle certain classes of problems. From this post, we can get an idea of the high-level requirements and usage, making it worth the read.

Comments closed

Getting Around in Spark

Tomaz Kastrun continues a series on Apache Spark. Day 3 shows off the CLI and web UI:

In CLI we will type and run a simple Scala script and observe the behaviour in the WEB UI.

We will read the text file into RDD (Resilient Distributed Dataset). Spark engine resides on location:

/usr/local/Cellar/apache-spark/3.2.0 for MacOS and
C:\SparkApp\spark-3.2.0-bin-hadoop3.2 for Windows (based on the blogpost from Dec.1)

Day 4 compares local mode versus cluster mode:

Finding the best way to write Spark will be dependent of the language flavour. As we have mentioned, Spark runs both on Windows and Mac OS or Linux (both UNIX-like systems). And you will need Java installed to run the clusters. Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. And the language flavour can also determine which IDE will be used.

Day 5 shows the setup of a Spark cluster:

Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment. If you decide to use Hadoop and YARN, there is usually the installation needed to install everything on nodes. Installing Java, JavaJDK, Hadoop and setting all the needed configuration. This installation is preferred when installing several nodes. A good example and explanation is available here. you will also be installing HDFS that comes with Hadoop.

Check out all three posts and get caught up on Spark.

Comments closed

Installing Apache Spark

Tomaz Kastrun continues a series on Apache Spark:

Installing Apache Spark on Windows computer will require preinstalled Java JDK (Java Development Kit). Java 8 or later version, with current version 17. On Oracle website, download the Java and install it on your system. Easiest way is to download the x64 MSI Installer. Install the file and follow the instructions. Installer will create a folder like “C:\Program Files\Java\jdk-17.0.1”.

Read on for instructions for both Windows and MacOS. You can also create a container running Spark, which is another helpful method.

Comments closed

ElasticMapReduce Serverless

Damon Cortesi, et al, announce serverless EMR is now in preview:

Today we’re happy to announce Amazon EMR Serverless, a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, you can run applications built using open-source frameworks such as Apache Spark, Hive, and Presto, without having to configure, manage, optimize, or secure clusters. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that your applications use.

In this post, we discuss the benefits of EMR Serverless, walk you through the core concepts of EMR Serverless and how you can use it, and show you a quick demo.

If you’re already using EMR for ephemeral work—that is, using a Spark cluster to perform data transformations and then shutting it down—this makes a lot of sense as long as there’s not a major difference in cost.

Comments closed

A Primer on Apache Spark

Tomaz Kastrun has started a new series:

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally it was developed at the Berkeley’s AMPLab, and later donated to the Apache Software Foundation, which has maintained it since.

Click through to learn more about the product.

Comments closed

Managing Spark on Kubernetes instead of YARN

Rohit Choudhary argues that it’s a good idea to move from YARN to Kubernetes for your Spark clusters:

When it comes to data operations, Spark provides a tremendous advantage as a resource for data operations because it aligns with the things that make data ops valuable. It is optimized for machine learning and AI, which are used for batch processing (in real-time and at scale), and it is adept at operating within different types of environments.

Spark doesn’t completely manage these clusters of machines but instead uses a cluster manager (known as a scheduler). Most companies have traditionally used the Java Virtual Machine (JVM)-based Hadoop YARN to manage their clusters. But with the dramatic rise of Kubernetes and cloud-native computing, many organizations are moving away from YARN to Kubernetes to manage their Spark clusters. Spark on Kubernetes is even now generally available since the Apache Spark 3.1 release in March 2021.

I see some of the benefits there but am not totally sold, especially given the complexity of Kubernetes and its own lack of built-in security measures.

Comments closed