Press "Enter" to skip to content

Category: Hadoop

Adding Libraries in Databricks

Arun Sirpal has some third-party libraries to add:

It is a really common requirement to add specific libraries to databricks. Libraries can be written in Python, Java, Scala, and R. You can upload Java, Scala, and Python libraries and point to external packages in PyPI, Maven, and CRAN repositories.

Libraries can be added in 3 scopes. Workspace, Notebook-scoped and cluster. I want to show you have easy it is to add (and search) for a library that you can add to the cluster, so that all notebooks attached to the cluster can leverage the library.

I’m hoping that loading libraries in Azure Synapse Analytics will, at some point, be this convenient.

Comments closed

Dotnet-Spark UDFs and Missing Shared State

Ed Elliott uncovers a mystery:

To understand this we need to take a look at how we can create a UDF in .NET that is called by the Java VM Apache Spark code because, that is logically, what happens. In our application we call into Apache Spark and ask it to do things like read from a file, run some transformation and write files back out again. With UDF’s, we ask Spark to run a UDF and Spark comes back to our UDF, passing it some data and asks the UDF to execute but the Java VM does not understand how to execute .NET code.

Read the whole thing.

Comments closed

Preparing for the Kafka-Zookeeper Breakup

Yeva Byzek prepares us:

As described in the blog post Apache Kafka® Needs No Keeper: Removing the Apache ZooKeeper Dependency, when KIP-500 lands next year, Apache Kafka will replace its usage of Apache ZooKeeper with its own built-in consensus layer. This means that you’ll be able to remove ZooKeeper from your Apache Kafka deployments so that the only thing you need to run Kafka is…Kafka itself. Kafka’s new architecture provides three distinct benefits. First, it simplifies the architecture by consolidating metadata in Kafka itself, rather than splitting it between Kafka and ZooKeeper. This improves stability, simplifies the software, and makes it easier to monitor, administer, and support Kafka. Second, it improves control plane performance, enabling clusters to scale to millions of partitions. Finally, it allows Kafka to have a single security model for the whole system, rather than having one for Kafka and one for Zookeeper. Together, these three benefits greatly simplify overall infrastructure design and operational workflows.

Read on to see where this story is at and what kinds of changes you’ll have to make to code.

Comments closed

Using Key Vault in Azure Databricks

Arun Sirpal shows us how easy it is to tie Azure Key Vault into Azure Databricks:

The key vault should always be a core component of your Azure design because we can store keys, secrets, certicates thus abstract / hide the true connection string within files. When working with databricks to mount storage to ingest your data and query it ideally you should be leveraging this to create secrets and secret scopes.

Click through for a demo.

Comments closed

Outlier Identification Using Spark 3.0

Tori Tompkins takes us through principles of anomaly detection in Apache Spark 3.0:

To calculate Median Absolute Deviation (MAD) you need to calculate the difference between the value and the median. In simpler terms, you will need to calculate the median of the entire dataset, the difference between each value and this median, then take another median of all the differences.

In Spark you can use a SQL expression ‘percentile()’ to calculate any medians or quartiles in a dataframe. ‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ because we want the 50% value ie median) and will return an array of results.

Like standard deviation, to use MAD to identify the outliers it needs to be a certain number of MAD’s away. This number is also referred to as the threshold and is defaulted to 3.

Read on for three measures and their implementations in PySpark.

Comments closed

Features and Improvements in Spark 3.0

Manoj Pandey summarizes some of the improvements in Apache Spark 3.0:

With Spark 3.0 release (on June 2020) there are some major improvements over the previous releases, some of the main and exciting features for Spark SQL & Scala developers are AQE (Adaptive Query Execution), Dynamic Partition Pruning and other performance optimization and enhancements.

Below I’ve listed out these new features and enhancements all together in one page for better understanding and future reference.

Click through for the summary.

Comments closed

Intrusion Detection using ksqldb

Maxime Ribera and Geraud Duge de Beronville take us through an interesting tutorial:

Apache Kafka® is a distributed real-time processing platform that allows for the ingestion of huge volumes of data. ksqlDB is part of the Kafka ecosystem and offers a SQL-like language to query and process large-scale, real-time data. This blog post demonstrates how to quickly process network activity for detection intrusion using both Kafka and ksqlDB.

For testing purposes (and to avoid being banned from the enterprise network), a virtualized environment through Vagrant is used.

Click through for the scenario.

Comments closed

Adaptive Query Execution in Databricks

MaryAnn Xue and Allison Wang explain how Adaptive Query Execution works with Databricks:

One of the most important cost-based decisions made in the Spark optimizer is the selection of join strategies, which is based on the size estimation of the join relations. But since this estimation can go wrong in both directions, it can either result in a less efficient join strategy because of overestimation, or even worse, out-of-memory errors because of underestimation.

AQE offers a trouble-free solution here by switching to the faster broadcast hash join during execution time.

This is pretty similar to Adaptive Query Processing in SQL Server.

Comments closed

The Spark Starter Guide

Landon Robinson has some good news for us:

If you visit hadoopsters.com/spark or thesparkguide.com, you’ll see something new and exciting from us. It’s official: we’ve written and are publishing a comprehensive guide to Apache Spark.

This guide will be completely online and completely free. A book’s worth of content, containing exercises in Python and Scala to teach you Spark, at your fingertips. Again, free.

Landon has posted chapter 1, section 1 already:

This section introduces the concept of data pipelines – how data is processed from one form into another. It’s also the generic term used to describe how data moves from one location or form, and is consumed, altered, transformed, and delivered to another location or form.

You’ll be introduced to Spark functions like joinfilter, and aggregate to process data in a variety of forms. You’ll learn it all through interactive Spark exercises in Scala and Python.

This is very early in the process but I’m excited.

Comments closed