Press "Enter" to skip to content

Category: Hadoop

Anomaly Detection over Delta Live Tables

Avinash Sooriyarachchi and Sathish Gangichetty show off an interesting scenario:

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.

Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed

Creating Identity Columns in Databricks

Franco Patano generates some identity integers:

Identity columns solve the issues mentioned above and provide a simple, performant solution for generating surrogate keys. Delta Lake is the first data lake protocol to enable identity columns for surrogate key generation.

Delta Lake now supports creating IDENTITY columns that can automatically generate unique, auto-incrementing ID numbers when new rows are loaded. While these ID numbers may not be consecutive, Delta makes the best effort to keep the gap as small as possible. You can use this feature to create surrogate keys for your data warehousing workloads easily.

This is a bit light on explanation, unfortunately. With distributed systems, generating identities is historically tricky (especially with several independent nodes generating values) so I’d be curious to see how it works: do they allocate blocks of IDs to worker nodes or do something else? And are the IDs guaranteed to be monotonically increasing? Or is there some other service which “labels” the data upon insert and provides those IDs?

Comments closed

Serverless Compute for Databricks SQL

Nikhil Jethava and Shankar Sivadasan make an announcement:

We are excited to announce the preview of Serverless compute for Databricks SQL (DBSQL) on Azure Databricks. DBSQL Serverless makes it easy to get started with data warehousing on the lakehouse. Serverless compute for DBSQL helps address challenges customers face with cluster startup time, capacity management, and infrastructure costs:

Click through for more details and a short video. Azure Synapse Analytics and Databricks are definitely going head-to-head in the modern data warehousing space and I’m fine with that—hopefully it makes both products better as a result.

Comments closed

Security Practices for Delta Sharing

Andrew Weaver, et al, share some advice:

When you enable Delta Sharing, you configure the token lifetime for recipient credentials. If you set the token lifetime to 0, recipient tokens never expire.

Setting the appropriate token lifetime is critically important for regulatory, compliance and reputational standpoint. Having a token that never expires is a huge risk; therefore, it is recommended using short-lived tokens as best practice. It is far easier to grant a new token to a recipient whose token has expired than it is to investigate the use of a token whose lifetime has been improperly set.

Click through for eight such tips.

Comments closed

Azure Databricks Initialization Scripts

Alex Crampton explains how initialization scripts work in Azure Databricks:

This blog will demonstrate the use of cluster-scoped initialisation scripts for Azure Databricks. An example will run through how to configure an initialisation script to install libraries on to a cluster that are not included in the Azure Databricks runtime environment. It will cover how to do this firstly using the Databricks UI, followed by how to include it in your CI/CD solutions.

Read on for some examples.

Comments closed

Z-Ordering with Apache Impala

Zoltan Borok-Nagy and Norbert Luksa show off a performance improvement in Apache Impala:

So we’ll have great search capabilities against the partition columns plus one data column (which drives the ordering in the data files). With our sample schema above, this means we could specify a SORT BY “platform” to enable fast analysis of all Android or iOS users. But what if we wanted to understand how well version 5.16 of our app is doing across platforms and countries?

Can we do more? It turns out that we can. There are exotic orderings out there that can also sort data by multiple columns. In this post, we will describe how Z-order allows ordering of multidimensional data (multiple columns) with the help of a space-filling curve. This ordering enables us to efficiently search against more columns. More on that later.

It looks like a really good technique for nearly-static data, sort of like you’d see with a data warehouse which refreshes once a day.

Comments closed

Databricks Extension for VSCode at 1.0

Gerhard Brueckl shares the good news:

As you probably know from my previous posts, my colleagues at paiqo.com and I are constantly working to improve our VSCode extension for Databricks. Almost every month we silently release a new version to the VSCode gallery so you get the latest features. However, as this is a special release, I am also writing a dedicated blog post for it

There’s a lot of cool stuff in here, so check it out.

Comments closed

Feeding Synapse Spark Info to On-Prem Kafka Clusters

Bhadreshkumar Shiyal finds a solution:

Microsoft’s official documentation for Azure Data Factory contains a tutorial which explains how to access an On-Premises SQL Server from Azure Data Factory which is inside a Managed Vnet. You can go through that article here: Access on-premises SQL Server from Data Factory Managed Vnet using Private Endpoint – Azure Data Fac….

Although based upon the article’s solution, to meet our requirements we needed to substitute On-Prem Apache Kafka for On-Prem SQL Server and instead of an Azure Data Factory inside a Managed Vnet, we used a Synapse Workspace inside a Managed Vnet. The “Forwarding Vnet” concept explained in the above tutorial remains as-is in our approach.

As soon as you turn on Data Exfiltration Protection (DEP), the lockdown is real. Click through to see what the process of exfiltrating data through an approved mechanism looks like.

Comments closed

“Warming Up” Databricks Clusters

Ust Oldfield needs that cluster to be up:

Interactive and SQL Warehouse (formerly known as SQL Endpoint) clusters take time to become active. This can range from around 5 mins through to almost 10 mins. For some workloads and users, this waiting time can be frustrating if not unacceptable.
For this use case, we had streaming clusters that needed to be available for when streams started at 07:00 and to be turned off when streams stopped being sent at 21:00. Similarly, there was also need from business users for their SQL Warehouse clusters to be available for when business started trading so that their BI reports didn’t timeout waiting for the clusters to start.

Read on to see one way to solve this problem without having a cluster run 24/7.

Comments closed

Python UDFs in Databricks SQL

Martin Grund, et al, announce a new preview feature in Databricks:\

To define the Python UDF, all you have to do is a CREATE FUNCTION SQL statement. This statement defines a function name, input parameters and types, specifies the language as PYTHON, and provides the function body between $$.

The function body of a Python UDF in Databricks SQL is equivalent to a regular Python function, with the UDF itself returning the computation’s final value. Dependencies from the Python standard library and Databricks Runtime 10.4, such as the json package in the above example, can be imported and used in your code. You can also define nested functions inside your UDF to encapsulate code to build or reuse complex logic.

I think my biggest concern here would be performance, though I say that without having used the feature.

Comments closed