Hadoop – Page 39 – Curated SQL

Yesterday we started working towards data import and how to use drop zone to import data to DBFS. We have also created our first Notebook and this is where I would like to start today. With a light introduction to notebooks.

Read on for a depiction of notebooks, as well as an example which loads data into the Databricks File System (DBFS).

Comments closed

Joining Data Streams in Flink

Published 2020-12-04 by Kevin Feasel

Kundan Kumarr crosses the streams:

Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like Union, Join, and so on. In this blog, we will explore the Window Join operator in Flink with an example. It joins two data streams on a given key and a common window.

Click through for an example of the fluent API approach. It’s not as nice as proper SQL, but it does the job.

Comments closed

Spark Starter Guide: Data Standardization

Published 2020-12-04 by Kevin Feasel

Ladon Robinson continues the Spark Starter Guide:

Standardization is the practice of analyzing columns of data and identifying synonyms or like names for the same item. Similar to how a cat can also be identified as a kitty, kitty cat, kitten or feline, we might want to standardize all of those entries into simply “cat” so our data is less messy and more organized. This can make future processing of the data more streamlined and less complicated. It can also reduce skew, which we address in Addressing Data Cardinality and Skew.
We will learn how to standardize data in the following exercises.

Check it out. I’m excited to see the Spark Starter Guide get fleshed out and written.

Comments closed

Creating a Spark DataFrame Ex Nihilo

Published 2020-12-03 by Kevin Feasel

Rahul Agarwal shows how you can gin up your own Spark DataFrame:

In broad terms, a DataFrame(DF) is a distributed, table-like structure with rows and columns and has a well-defined schema. DataFrames can be constructed from a wide variety of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

Click through for an example in Scala.

Comments closed

Identifying Straggler Tasks in Spark Applications

Published 2020-11-25 by Kevin Feasel

Ajay Gupta clues us in on a process:

What Is a Straggler in a Spark Application?
A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A straggler Task takes an exceptionally high time for completion as compared to the median or average time taken by other tasks belonging to the same stage. There could be multiple stragglers in a Spark Job being present either in the same stage or across multiple stages.

Read on to understand the consequences and causes of these straggler tasks, as well as what you can do about them.

Comments closed

Scaling ksqlDB, with Animations

Published 2020-11-23 by Kevin Feasel

Michael Drogalis walks us through scaling models with ksqlDB:

Software engineering memes are in vogue, and nothing is more fashionable than joking about how complicated distributed systems can be. Despite the ribbing, many people adopt them. Why? Distributed systems give us two things their single node counterparts cannot: scale and fault tolerance.
ksqlDB, the event streaming database, is built with a client/server architecture. You can run it with a single server, or you can cluster many servers together. Part 1 and part 2 in this series explained how a single server executes stateless and stateful operations. This post is about how these work when ksqlDB is deployed with many servers, and more importantly how it linearly scales the work it is performing—even in the presence of faults.
If you like, you can follow along by executing the example code yourself. ksqlDB’s quickstart makes it easy to get up and running.

Click through for well-animated examples.

Comments closed

Q&A about the Lakehouse

Published 2020-11-20 by Kevin Feasel

Terry McCann posts Q&A from Simon Whiteley’s session on Lakehouse models in Spark 3.0:

“WHILE ALL THE HADOOP PROVIDERS PROMOTED THE DATALAKE PARADIGM BACK THEN, HOW THE INDUSTRY AND THE OTHER DATA LAKE PROVIDERS ARE SHIFTING TO/CONSIDERING THE LAKE HOUSE PARADIGM?“
It’s a direction that most providers are heading in, albeit under the “unified analytics” or “modern warehouse” name rather than the “lakehouse”. But most big relational engines are moving to bring in spark/big data capabilities, other lake providers are looking to expand their SQL coverage. It’s a bit of a race to who gets to the “can do both sides as well as a specialist tool” point first. Will we see other tools championing it as a “lakehouse”, or is that term now tied too closely as a “vendor-specific” term coming from Databricks? We’ll see…

Click through for some good questions and thoughtful answers.

Comments closed

The Evolving Lakehouse

Published 2020-11-13 by Kevin Feasel

Simon Whiteley looks at the current status of the Lakehouse model:

We have discussed in the past this idea of the lakehouse, the aspirational target of many analytics platforms these days of combining the huge power and potential of data lakes with the rigour, reliability and concurrency of a data warehouse. It’s an interesting concept but has, in the past, been firmly an aspiration.
In the world without lakehouses, we often see the “Modern Data Warehouse”, this two-phased approach to providing a holistic platform – we load our early data into a lake where we shape it and massage it into an understandable state. It is here we perform data science, exploratory data analysis, early sight analytics prototyping and various other functions that don’t quite fit into a data warehouse… but then we load our data into a relational store for serving to the business. This is where we can meet their demands for a rich SQL environment, auditable data models and rigorous change procedures. Essentially, we store data twice so that we can achieve the best of both worlds.

Definitely read Simon’s take on it. My take is that the Lakehouse concept will start to be useful to specific companies in about 2-3 years, as I don’t think the performance is there today.

Comments closed

Monitoring Kafka Metrics with Prometheus and Grafana

Published 2020-11-12 by Kevin Feasel

Murat Derman shows how you can use Prometheus and Grafana to track vital measures on an Apache Kafka cluster:

You can add scrape_interval parameter in your configuration by default it is every 1 minute scrape_interval: 5s
Prometheus has its own query language called promql. You can learn more about this language from this here https://prometheus.io/docs/prometheus/latest/querying/basics/
There are lot of metrics you can define for Kafka. I will mention a few of them in this article

Read on for a breakdown of some of these measures.

Comments closed

Automating Hadoop Workflows with Spark and Oozie

Published 2020-11-06 by Kevin Feasel

Prashanth Jayaram walks us through automating a sample data transfer with tools like Sqoop, Spark, and Oozie:

In the process of building a data product one would end-up applying many resource-intensive analytical operations on a medium to large data-set in an efficient way. Apache Spark is the bet in this scenario to perform faster job execution by caching data in memory and enabling parallelism in a distributed data environments.
Components involved in Spark implementation:
1. Initialize spark session using scala program
2. Ingest data from data lake through hive queries
3. Apply business logic using scala constructs or hive queries
4. Load data into HDFS or Hive targets
5. Execute spark programs through spark submit

Read on for a sample flow.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Category: Hadoop

Using Notebooks to Load Data into the Databricks File System

Joining Data Streams in Flink

Spark Starter Guide: Data Standardization

Creating a Spark DataFrame Ex Nihilo

Identifying Straggler Tasks in Spark Applications

Scaling ksqlDB, with Animations

Q&A about the Lakehouse

The Evolving Lakehouse

Monitoring Kafka Metrics with Prometheus and Grafana

Automating Hadoop Workflows with Spark and Oozie