Spark – Page 22 – Curated SQL

Learning about RDDs in Spark

Published 2021-12-13 by Kevin Feasel

Tomaz Kastrun continues a series on Spark. Part 7 ties in R and gives us sample plotting in R and Python:

Let’s look into the local use of Spark. For R language, sparklyr package is availble and for Python pyspark is availble.

Part 8 gets us into the key data structure behind Spark’s success, the Resilient Distributed Dataset:

Spark is created around the concept of resilient distributed datasets (RDD). RDD is a fault-tolerant collection of files that can be used in parallel. RDDs can be created in two ways:
– parallelising an existing data collection in driver program
– referencing a datasets in external storage (HDFS, blob storage, shared filesystem, Hadoop InputFormat,…)
In a simple way, Spark RDD has two opeartions:
– transformations – creating a new RDD dataset on top of already existing one with the last transformation
– actions – to the action, and return a value to the driver program after running a computation on the dataset.

Part 9 looks a bit more at transformations and actions:

Two types of operations are available with RDD; transformations and actions. Transformations are lazy operations, meaning that they prepare the new RDD with every new operation but now show or return anything. We can say, that transformations are lazy because of updating existing RDD, these operations create another RDD. Actions on the other hand trigger the computations on RDD and show (return) the result of transformations.

Most modern work in Spark won’t directly use RDDs, though everything is built on top of them and it’s good to understand the foundation even if you don’t need to write all of those map(), fold(), and reduceByKey() operations yourself.

Comments closed

Deploying dbt on Databricks

Published 2021-12-09 by Kevin Feasel

Dave Eyler, et al, have a great announcement:

At Databricks, nothing makes us happier than making our users more productive, which is why we are delighted to announce a native adapter for dbt. It’s now easier than ever to develop robust data pipelines on Databricks using SQL.
dbt is a popular open source tool that lets a new breed of ‘analytics engineer’ build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple.

Click through for more information on how this works and how you can get the native adapter.

Comments closed

IDE Options for Spark

Published 2021-12-07 by Kevin Feasel

Tomaz Kastrun shares some thoughts on IDEs for writing Spark code:

Let’s look into the IDE that can be used to run Spark.
Remember that Spark can be used with languages: Scala, Java, R, Python and each give you different IDE and different installations.

To Tomaz’s list, I’d add IntelliJ IDEA for Scala programming. I was never much a fan of Eclipse, but there are people who swear by it.

Comments closed

Using Scala at Databricks

Published 2021-12-07 by Kevin Feasel

Li Haoyi gives us a peek behind the curtain:

With hundreds of developers and millions of lines of code, Databricks is one of the largest Scala shops around. This post will be a broad tour of Scala at Databricks, from its inception to usage, style, tooling and challenges. We will cover topics ranging from cloud infrastructure and bespoke language tooling to the human processes around managing our large Scala codebase. From this post, you’ll learn about everything big and small that goes into making Scala at Databricks work, a useful case study for anyone supporting the use of Scala in a growing organization.

It’s always interesting to see how the largest companies handle certain classes of problems. From this post, we can get an idea of the high-level requirements and usage, making it worth the read.

Comments closed

Getting Around in Spark

Published 2021-12-06 by Kevin Feasel

Tomaz Kastrun continues a series on Apache Spark. Day 3 shows off the CLI and web UI:

In CLI we will type and run a simple Scala script and observe the behaviour in the WEB UI.
We will read the text file into RDD (Resilient Distributed Dataset). Spark engine resides on location:
/usr/local/Cellar/apache-spark/3.2.0 for MacOS and
C:\SparkApp\spark-3.2.0-bin-hadoop3.2 for Windows (based on the blogpost from Dec.1)

Day 4 compares local mode versus cluster mode:

Finding the best way to write Spark will be dependent of the language flavour. As we have mentioned, Spark runs both on Windows and Mac OS or Linux (both UNIX-like systems). And you will need Java installed to run the clusters. Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. And the language flavour can also determine which IDE will be used.

Day 5 shows the setup of a Spark cluster:

Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment. If you decide to use Hadoop and YARN, there is usually the installation needed to install everything on nodes. Installing Java, JavaJDK, Hadoop and setting all the needed configuration. This installation is preferred when installing several nodes. A good example and explanation is available here. you will also be installing HDFS that comes with Hadoop.

Check out all three posts and get caught up on Spark.

Comments closed

Installing Apache Spark

Published 2021-12-03 by Kevin Feasel

Tomaz Kastrun continues a series on Apache Spark:

Installing Apache Spark on Windows computer will require preinstalled Java JDK (Java Development Kit). Java 8 or later version, with current version 17. On Oracle website, download the Java and install it on your system. Easiest way is to download the x64 MSI Installer. Install the file and follow the instructions. Installer will create a folder like “C:\Program Files\Java\jdk-17.0.1”.

Read on for instructions for both Windows and MacOS. You can also create a container running Spark, which is another helpful method.

Comments closed

ElasticMapReduce Serverless

Published 2021-12-02 by Kevin Feasel

Damon Cortesi, et al, announce serverless EMR is now in preview:

Today we’re happy to announce Amazon EMR Serverless, a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, you can run applications built using open-source frameworks such as Apache Spark, Hive, and Presto, without having to configure, manage, optimize, or secure clusters. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that your applications use.
In this post, we discuss the benefits of EMR Serverless, walk you through the core concepts of EMR Serverless and how you can use it, and show you a quick demo.

If you’re already using EMR for ephemeral work—that is, using a Spark cluster to perform data transformations and then shutting it down—this makes a lot of sense as long as there’s not a major difference in cost.

Comments closed

A Primer on Apache Spark

Published 2021-12-02 by Kevin Feasel

Tomaz Kastrun has started a new series:

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally it was developed at the Berkeley’s AMPLab, and later donated to the Apache Software Foundation, which has maintained it since.

Click through to learn more about the product.

Comments closed

Managing Spark on Kubernetes instead of YARN

Published 2021-11-29 by Kevin Feasel

Rohit Choudhary argues that it’s a good idea to move from YARN to Kubernetes for your Spark clusters:

When it comes to data operations, Spark provides a tremendous advantage as a resource for data operations because it aligns with the things that make data ops valuable. It is optimized for machine learning and AI, which are used for batch processing (in real-time and at scale), and it is adept at operating within different types of environments.
Spark doesn’t completely manage these clusters of machines but instead uses a cluster manager (known as a scheduler). Most companies have traditionally used the Java Virtual Machine (JVM)-based Hadoop YARN to manage their clusters. But with the dramatic rise of Kubernetes and cloud-native computing, many organizations are moving away from YARN to Kubernetes to manage their Spark clusters. Spark on Kubernetes is even now generally available since the Apache Spark 3.1 release in March 2021.

I see some of the benefits there but am not totally sold, especially given the complexity of Kubernetes and its own lack of built-in security measures.

Comments closed

MMLSpark Is Now SynapseML

Published 2021-11-24 by Kevin Feasel

Mark Hamilton has an announcement:

Today, we’re excited to announce the release of SynapseML (previously MMLSpark), an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. Building production-ready distributed ML pipelines can be difficult, even for the most seasoned developer. Composing tools from different ecosystems often requires considerable “glue” code, and many frameworks aren’t designed with thousand-machine elastic clusters in mind. SynapseML resolves this challenge by unifying several existing ML frameworks and new Microsoft algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.

Read on to learn more about the library.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Spark