Spark – Page 29 – Curated SQL

sparklyr 1.5 Released

Published 2020-12-16 by Kevin Feasel

Yitao Li announces version 1.5 of sparklyr:

A large fraction of pull requests that went into the sparklyr 1.5 release were focused on making Spark dataframes work with various dplyr verbs in the same way that R dataframes do. The full list of dplyr-related bugs and feature requests that were resolved in sparklyr 1.5 can be found in here.
In this section, we will showcase three new dplyr functionalities that were shipped with sparklyr 1.5.

Read on to learn more about this update. H/T R-Bloggers

Comments closed

Running Spark on Azure Kubernetes Service

Published 2020-12-16 by Kevin Feasel

Tsuyoshi Matsuzaki walks us through running Apache Spark on Azure Kubernetes Service:

Apache Spark officially includes Kubernetes support, and thereby you can run a Spark job on your own Kubernetes cluster. (See here for official document. Note that Kubernetes scheduler is currently experimental.)
Especially in Microsoft Azure, you can easily run Spark on cloud-managed Kubernetes, Azure Kubernetes Service (AKS).
In this post, I’ll show you step-by-step tutorial for running Apache Spark on AKS. In this tutorial, artifacts, such as, source code, data, and container images are all protected by Azure credentials (keys).
Although managed services for Apache Spark, such as, Azure Databricks, Azure Synapse Analytics, and Azure HDInsight, is the best place to run Spark workloads, you will get much flexibility by running workloads on managed Kubernetes (AKS) – such as, spot VM support, start/stop cluster, confidential computing (Intel SGX) support, so on and so forth.

Read on to see how. Though of these options, I’d probably choose Azure Databricks or Azure Synapse Analytics well before the others.

Comments closed

Using Koalas with Azure Databricks

Published 2020-12-14 by Kevin Feasel

Tomaz Kastrun continues a series on Azure Databricks:

So far, we looked into SQL, R and Python and this post will be about Python Koalas package. A special implementation of pandas DataFrame API on Apache Spark. Data Engineers and data scientist love Python pandas, since it makes data preparation with pandas easier, faster and more productive. And Koalas is a direct “response” to make writing and coding on Spark, easier and more familiar. Also follow the official documentation with full description of the package.

Click through for a quick demo.

Comments closed

Connecting to Blob Storage from Azure Databricks

Published 2020-12-09 by Kevin Feasel

Tomaz Kastrun continues opening doors in the advent calendar:

Yesterday we introduced the Databricks CLI and how to upload the file from “anywhere” to Databricks. Today we will look how to use Azure Blob Storage for storing files and accessing the data using Azure Databricks notebooks.

Click through to see how.

Comments closed

Ignoring Bad Dates when Moving to Spark 3

Published 2020-12-07 by Kevin Feasel

Robert Blackburn shows us one way to handle bad dates when moving to Spark 3:

Moving from a Spark 2 to a Spark 3 runtime has a lot of benefits including big performance improvements through adaptive query execution, dynamic partition pruning, and other optimizations. Some updates may require you to refactor your code. One of them is Delta tables now use the Proleptic Gregorian Calendar. Isn’t a calendar a calendar? Unfortunately, no. The Julian calendar has discrepancies with old dates. Specifically dates before 1582 and timestamps before 1900. Here we will dynamically update these dates for incoming source files.
If you would like to follow along in detail, I have a sample notebook that uses the community edition of Databricks. The DBC Archive file is here and the source file is here.

Fortunately, this change is unlikely to affect most of us, with perhaps the most common issue being that you used 0001-01-01 as a default date.

Comments closed

Using Notebooks to Load Data into the Databricks File System

Published 2020-12-07 by Kevin Feasel

Tomaz Kastrun is putting together an Advent of Azure Databricks:

Yesterday we started working towards data import and how to use drop zone to import data to DBFS. We have also created our first Notebook and this is where I would like to start today. With a light introduction to notebooks.

Read on for a depiction of notebooks, as well as an example which loads data into the Databricks File System (DBFS).

Comments closed

Joining Data Streams in Flink

Published 2020-12-04 by Kevin Feasel

Kundan Kumarr crosses the streams:

Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like Union, Join, and so on. In this blog, we will explore the Window Join operator in Flink with an example. It joins two data streams on a given key and a common window.

Click through for an example of the fluent API approach. It’s not as nice as proper SQL, but it does the job.

Comments closed

Spark Starter Guide: Data Standardization

Published 2020-12-04 by Kevin Feasel

Ladon Robinson continues the Spark Starter Guide:

Standardization is the practice of analyzing columns of data and identifying synonyms or like names for the same item. Similar to how a cat can also be identified as a kitty, kitty cat, kitten or feline, we might want to standardize all of those entries into simply “cat” so our data is less messy and more organized. This can make future processing of the data more streamlined and less complicated. It can also reduce skew, which we address in Addressing Data Cardinality and Skew.
We will learn how to standardize data in the following exercises.

Check it out. I’m excited to see the Spark Starter Guide get fleshed out and written.

Comments closed

Creating a Spark DataFrame Ex Nihilo

Published 2020-12-03 by Kevin Feasel

Rahul Agarwal shows how you can gin up your own Spark DataFrame:

In broad terms, a DataFrame(DF) is a distributed, table-like structure with rows and columns and has a well-defined schema. DataFrames can be constructed from a wide variety of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

Click through for an example in Scala.

Comments closed

Identifying Straggler Tasks in Spark Applications

Published 2020-11-25 by Kevin Feasel

Ajay Gupta clues us in on a process:

What Is a Straggler in a Spark Application?
A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A straggler Task takes an exceptionally high time for completion as compared to the median or average time taken by other tasks belonging to the same stage. There could be multiple stragglers in a Spark Job being present either in the same stage or across multiple stages.

Read on to understand the consequences and causes of these straggler tasks, as well as what you can do about them.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark