Press "Enter" to skip to content

Category: Spark

Spark Access Control in Qubole

Achuth Rajagopal and Shridhar Ramachandran show off the Spark Data Access Control Framework on Qubole’s platform:

With these requirements in mind, we decided to implement Hive Authorization as our first Policy Manager. Hive Authorization policies are stored in the Qubole Metastore which acts as a shared central component and stores metadata related to Hive Resources like Hive Tables. We enhanced Spark to honor the policies stored in the Qubole Metastore while accessing Hive Tables or for adding and modifying those policies.

In summary, we implemented a SQL standard access control layer identical to what is present in Apache Hive or Presto today. The following sections detail the architecture and provide an example that illustrates how it works.

Click through to learn more.

Comments closed

MLflow 1.1 Released

Max Allen, et al, announce the release of MLflow 1.1:

We’re excited to announce today the release of MLflow 1.1. In this release, we’ve focused on fleshing out the tracking component of MLflow and improving visualization components in the UI.

Some of the major features include:
– Automatic logging from TensorFlow and Keras
– Parallel coordinate plots in the tracking UI
Pandas DataFrame based search API
– Java Fluent API
– Kubernetes execution backend for MLflow projects
– Search Pagination

Looks like they’re putting in a lot of work on this.

Comments closed

Databricks Runtime 5.5

Bilal Aslam and Yifan Cao announce Databricks Runtime 5.5:

Secrets API in R notebooks
The Databricks Secrets API [Azure|AWS] lets you inject secrets into notebooks without hardcoding them. As of Databricks Runtime 5.5, this API is available in R notebooks in addition to existing support for Python and Scala notebooks. You can use the dbutils.secrets.get function to obtain secrets. Secrets are redacted before printing to a notebook cell.

There are some good updates in this release. Read on for the full list.

Comments closed

Notebooks in Azure Databricks

Brad Llewellyn takes us through Azure Databricks notebooks:

Azure Databricks Notebooks support four programming languages, Python, Scala, SQL and R.  However, selecting a language in this drop-down doesn’t limit us to only using that language.  Instead, it makes the default language of the notebook.  Every code block in the notebook is run independently and we can manually specify the language for each code block.

Before we get to the actually coding, we need to attach our new notebook to an existing cluster.  As we said, Notebooks are nothing more than an interface for interactive code.  The processing is all done on the underlying cluster.

Read on to learn how Databricks uses the notebook metaphor heavily in how you interact with it.

Comments closed

How .NET Code Talks to Spark

Ed Elliott has a great diagram showing how user-written .NET code communicates with Spark over the Java VM:

4. Spark-dotnet Java driver listens on tcp port
The spark-dotnet Java driver listens on a TCP socket. This socket is used to communicate between the Java VM and the dotnet code, the dotnet code doesn’t run in the Java VM but is in a separate process communitcating with the Java VM via that TCP postrt. The year is 2019, we serialize and deserialize data all the time and don’t even know it, hell notepad probably even does it.

It’s serialization & deserialization as well as TCP sockets all the way down.

Comments closed

Spark and dotnet in a Single Container

Ed Elliott shows how you can combine Spark and .NET Core in a single Docker container:

This is quite new syntax in docker and you need at least docker 17.05 (client and daemon), after the images “FROM blah” you can specify a name “core” in this case, then later you can copy from the first image to the second using “–from=” on the “COPY” command.

In this dockerfile I have added Spark 2.4.3 and the default environment variables we need to get spark running, if you grab this dockerfile and run “docker build -t dotnet-spark .” you should get an images you can then run which includes the dependencies for dotnet as well as spark.

Ed includes all of the scripts needed to test this out, too.

Comments closed

Feeding IoT Data into Delta Lake

Saeed Barghi shows how you can stream sensor data from Azure IoT Hub into Databricks Delta Lake:

IoT devices produce a lot of data very fast. Capturing data from all those devices, which could be at millions, and managing them is the very first step in building a successful and effective IoT platform.

Like any other data solution, an IoT data platform could be built on-premise or on cloud. I’m a huge fan of cloud based solutions specially PaaS offerings. After doing a little bit of research I decided to go with Azure since it has the most comprehensive and easy to use set of service offerings when it comes to IoT and they are reasonably priced. In this post, I am going to show how to build the architecture displayed in the diagram below: connect your devices to Azure IoT Hub and then ingest records into Databricks Delta Lake as they stream in using Spark Streaming.

Click through for the instructions.

Comments closed

Copying Cassandra Data to HDFS

Landon Robinson shows how you can use Spark to extract data from Cassandra and move it into HDFS:

Cassandra is a great open-source solution for accessing data at web scale, thanks in no small part to its low-latency performance. And if you’re a power user of Cassandra, there’s a high probability you’ll want to analyze the data it contains to create reports, apply machine learning, or just do some good old fashioned digging.

However, Cassandra can prove difficult to use as an analytical warehouse, especially if you’re using it to serve data in production around the clock. But one approach you can take is quite simple: copy the data to Hadoop (HDFS).

Read on to learn how.

Comments closed

Scala 2.13 Changes

Anmol Sarna takes us through what’s new in Scala 2.13:

Last, but not the least, the team has invested heavily in compiler speedups during the 2.13 cycle which resulted in some major changes with respect to the compiler.

Compiler performance in 2.13 is 5-10% better compared to 2.12, thanks mainly to the new collections.

There are a lot of changes in this version. I wonder how long before Spark supports it fully.

Comments closed