Hadoop – Page 37 – Curated SQL

DataFrame Cleaning in Spark

Published 2021-01-11 by Kevin Feasel

Craig Covey has an update to the Spark Starter Guide:

Real-world datasets are hardly ever clean and pristine. They commonly include blanks, nulls, duplicates, errors, malformed text, mismatched data types, and a host of other problems that degrade data quality. No matter how much data one might have, a small amount of high quality data is more beneficial than a large amount of garbage data. All decisions derived from data will be better with higher quality data.
In this section we will introduce some of the methods and techniques that Spark offers for dealing with “dirty data”. The term dirty data means data that needs to be improved so the decisions made from the data will be more accurate. The topic of dirty data and how to deal with it is a very broad topic with a lot of things to consider. This chapter intends to introduce the problem, show Spark techniques, and educate the user on the effects of “fixing” dirty data.

It’s interesting to see what’s available in Spark and how you can take advantage of it.

Comments closed

Adding Jars to a Spark Application

Published 2021-01-07 by Kevin Feasel

The Hadoop in Real World team show us a few ways to add jar files to a Spark application:

There are so many properties in Spark that affect the way you can add jars to a Spark application. We understand it could be confusing and this post is aimed at giving you clarity on different options and when to use which option.

Read on for the options.

Comments closed

Wrapping up the Azure Databricks Advent

Published 2021-01-04 by Kevin Feasel

Tomaz Kastrun laughs at 24-day advent calendars:

In the last two days we have focused on understanding Apache Spark through performance tuning and through troubleshooting. Both require some deeper understanding of Spark and Azure Databricks, but gives also a great insight to all who will need to improve performance and work with Spark.
Today, I would like to list couple of additional Learning material, documentation and any other additional resources for further exploration on Azure Databricks.

Click through for links to additional resources on Apache Spark and Databricks, as well as the other 30 entries in the series.

Comments closed

Deleting Messages and Topics in Kafka

Published 2020-12-30 by Kevin Feasel

The Hadoop in Real World team has a pair of related posts. The first is on how to remove messages in a Kafka topic:

The easiest way to purge or delete messages in a Kafka topic is by setting the retention.ms to a low value. retention.ms configuration controls how long messages should be kept in a topic. Once the age of the message in a topic hits the retention time the message will be removed from the topic.
Note the below steps delete or purge messages in your topic. Use precaution when executing the below.

Because Kafka is an immutable log rather than “final” storage, the ideal scenario has you never deleting data. But sometimes you just run low on disk space. You can also set the max retention size as another option. But note that these aren’t going to let you delete a single message—that’s not a good thing to do with a log; rather, you offset or cancel out the message and submit a new one.

The second post covers deletion of a Kafka topic:

In this post we will see how to delete a Kafka topic and get the details of the topic before deleting it.

Comments closed

Apache Spark Performance Tuning

Published 2020-12-29 by Kevin Feasel

Tomaz Kastrun provides a few hints when performance tuning Apache Spark code:

DataFrame versus Datasets versus SQL versus RDD is another choice, yet it is fairly easy. DataFrames, Datasets and SQL objects are all equal in performance and stability (at least from Spar 2.3 and above), meaning that if you are using DataFrames in any language, performance will be the same. Again, when writing custom objects of functions (UDF), there will be some performance degradation with both R or Python, so switching to Scala or Java might be a optimisation.

Read on for the details. My version is “When performance matters the most, be willing to switch to Scala.” It’s not always correct, but is rarely outright bad advice.

Comments closed

Using Powershell to Automate Azure Databricks Processes

Published 2020-12-28 by Kevin Feasel

Tomaz Kastrun continues a series on Databricks:

Yesterday we looked into bringing the capabilities of Databricks closer to your client machine. And making that coding, data wrangling and data science little bit more convenient.
Today we will look into deploying Databricks workspace using Powershell.

By the way, if Powershell automation of Databricks tasks is of interest to you, also check out Gerhard Brueckl’s extension module for much more along those lines.

Also, I give Tomaz a lot of credit: most Advent calendars stop at 24 days but Tomaz laughs off such limitations.

Comments closed

Spark Streaming in a Databricks Notebook

Published 2020-12-23 by Kevin Feasel

Tomaz Kastrun shows off Spark Streaming in a Databricks notebook:

Spark Streaming is the process that can analyse not only batches of data but also streams of data in near real-time. It gives the powerful interactive and analytical applications across both hot and cold data (streaming data and historical data). Spark Streaming is a fault tolerance system, meaning due to lineage of operations, Spark will always remember where you stopped and in case of a worker error, another worker can always recreate all the data transformation from partitioned RDD (assuming that all the RDD transformations are deterministic).

Click through for the demo.

Comments closed

Repartitioning and Coalescing in Spark

Published 2020-12-22 by Kevin Feasel

The Hadoop in Real World team contrasts repartitioning and coalescing in Spark:

The function of repartition and coalesce functions in Spark is to change the number of partitions on a DataFrame.

Read on to see how the two differ.

Comments closed

Using Scala in a Databricks Notebook

Published 2020-12-22 by Kevin Feasel

Tomaz Kastrun take a look at the original Spark language:

Let us start with Databricks datasets, that are available within every workspace and are here mainly for test purposes. This is nothing new; both Python and R come with sample datasets. For example the Iris dataset that is available with Base R engine and Seaborn Python package. Same goes with Databricks and sample dataset can be found in /databricks-datasets folder.

Click through for the walkthrough and introduction to Scala as it relates to Apache Spark.

Comments closed

Apache Spark Basics in Azure Synapse Analytics

Published 2020-12-21 by Kevin Feasel

Euan Garden shows off some Apache Spark functionality in Azure Synapse Analytics:

Apache Spark has been a long-time favorite tool amongst data engineers and data scientists; it is well known for handling large scale data processing and complex machine learning workloads.
Azure Synapse Analytics offers a fully managed and integrated Apache Spark experience. By leveraging Apache Spark in Azure Synapse, you can benefit from integrated security, fully managed provisioning, and tight-coupling to other Azure services, such as SQL databases (dedicated and serverless), Azure Key Vault , ADLS Gen2, and Azure Blob Storage as well as fast starting, high performance compute instances.

Click through for the demo.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop