Hadoop – Page 50 – Curated SQL

Azure Active Directory and the DatabricksPS Library

Published 2020-06-10 by Kevin Feasel

Gerhard Brueckl has updated the DatabricksPS library:

Databricks recently announced that it is now also supporting Azure Active Directory Authentication for the REST API which is now in public preview. This may not sound super exciting but is actually a very important feature when it comes to Continuous Integration/Continuous Delivery pipelines in Azure DevOps or any other CI/CD tool. Previously, whenever you wanted to deploy content to a new Databricks workspace, you first needed to manually create a user-bound API access token. As you can imagine, manual steps are also bad for otherwise automated processes like a CI/CD pipeline. With Databricks REST API finally supporting Azure Active Directory Authentication of regular users and service principals, this last manual step is finally also gone!

If you do use Databricks and haven’t tried out DatabricksPS, I highly recommend it. I think it’s a much nicer experience than hitting the REST API directly, particularly because it deals with continuation tokens and making multiple calls to get your results.

Comments closed

Using UDFs in Spark without Registration

Published 2020-06-05 by Kevin Feasel

Sourabh Mehta shows how we can immediately call a user-defined function in Spark without registering it first:

Here, we will demonstrate the use of UDF via a small example.
Use Case: We need to change the value of an existing column of DF/DS to add some prefix or suffix to the existing value in a new column.

I’m actually not sure what benefit you gain from not registering the UDF, but there probably is one.

Comments closed

Building a Stream Processing App with ksql

Published 2020-06-05 by Kevin Feasel

The Hadoop in Real World team walks us through event streaming with ksql:

ksqlDB is an event streaming database that enables creating powerful stream processing applications on top of Apache Kafka by using the familiar SQL syntax, which is referred to as KSQL. This is a powerful concept that abstracts away much of the complexity of stream processing from the user. Business users or analysts with SQL background can query the complex data structures passing through kafka and get real-time insights. In this article, we are going to see how to set up ksqlDB and also look at important concepts in ksql and its usage.

Event streaming has become a lot easier over the past couple of years, as Kafka, Spark, and Flink have all matured.

Comments closed

Running Databricks ETL Code from Azure Data Factory

Published 2020-06-05 by Kevin Feasel

Ed Elliott shares several approaches for executing Databricks code from Azure Data Factory:

.Net
No post I write about Spark is going to miss off .Net, if you are writing C# or F# then you can’t use a notebook anyway. If you do this then you will also include the .Net driver JAR file (weird statement to make) and call that.

Read on for all of the techniques.

Comments closed

Dynamic Partition Pruning in Apache Spark 3.0

Published 2020-06-02 by Kevin Feasel

Anjali Sharma walks us through a nice improvement in Spark SQL coming with Apache Spark 3.0:

Partition pruning in Spark is a performance optimization that limits the number of files and partitions that Spark reads when querying. After partitioning the data, queries that match certain partition filter criteria improve performance by allowing Spark to only read a subset of the directories and files. When partition filters are present, the catalyst optimizer pushes down the partition filters. The scan reads only the directories that match the partition filters, thus reducing disk I/O.
However, in reality data engineers don’t just execute a single query, or single filter in their queries, and the common case is that they actually have dimensional tables, small tables that they need to join with a larger fact table. So in this case, we can no longer apply static partition pruning because the filter is on one side of the join, and the table that is more appealing and more attractive to prune is on the other side of the join. So, we have a problem now.

And that’s where dynamic partition pruning comes into play.

Comments closed

How Writing to Parquet Works

Published 2020-06-01 by Kevin Feasel

Dmitry Tolpeko walks us through the algorithm for writing to Parquet format:

After writing first 100 values for a column (for 100 rows), the Parquet writer checks if this 100-values column content exceeds the specified page size (default is 1 MB).
If the raw data size for the column does not exceed the page size threshold then the next page size check is constantly adjusted based on the actual column size, so it neither checked after every column value nor after every 100 values. Thus the page size is not the strict limit.
If the raw data size size exceeds the page size, the column content is compressed (if a compression is specified for the Parquet file), and flushed into the Page store for the column.

This is a nice explanation of the process.

Comments closed

Adaptive Query Execution with Spark SQL

Published 2020-06-01 by Kevin Feasel

Wenchen Fan, Herman von Hoevell, and MaryAnn Xue announce Adaptive Query Execution for Apache Spark 3.0:

Over the years, there’s been an extensive and continuous effort to improve Spark SQL’s query optimizer and planner in order to generate high-quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark choose better plans. Examples of these cost-based optimization techniques include choosing the right join type (broadcast hash join vs. sort merge join), selecting the correct build side in a hash-join, or adjusting the join order in a multi-way join. However, outdated statistics and imperfect cardinality estimates can lead to suboptimal query plans. Adaptive Query Execution, new in the upcoming Apache SparkTM 3.0 release and available in the Databricks Runtime 7.0 beta, now looks to tackle such issues by reoptimizing and adjusting query plans based on runtime statistics collected in the process of query execution.

One of the biggest advantages of SQL as a fourth-generation language is that the database engine (whether that be SQL Server, Oracle, or Spark) gets the opportunity to write and re-write the set of operations needed to solve a query to try to find the best path which returns the same result set. These optimizations aren’t perfect, as any query tuner can tell you, but they can go a long way.

Comments closed

Pandas UDFs and Python Type Hints in Spark 3.0

Published 2020-05-29 by Kevin Feasel

Hyukjin Kwon announces some updates forthcoming in Apache Spark 3.0:

The Pandas UDFs work with Pandas APIs inside the function and Apache Arrow for exchanging data. It allows vectorized operations that can increase performance up to 100x, compared to row-at-a-time Python UDFs.
The example below shows a Pandas UDF to simply add one to each value, in which it is defined with the function called pandas_plus_one decorated by pandas_udf with the Pandas UDF type specified as PandasUDFType.SCALAR.

Click through for explanations and demos for each.

Comments closed

Securing a Kafka Deployment

Published 2020-05-29 by Kevin Feasel

Rick Spurgeon and Nikoleta Verbeck give us some tips on securing Apache Kafka:

Confluent Platform provides Role-Based Access Control (RBAC), which addresses the gaps listed above. RBAC and ACLs perform similar functions but approach the problem differently and can be used independently or cooperatively. RBAC is powered by Confluent’s Metadata Service (MDS), which integrates with LDAP and acts as the central authority for authorization and authentication data. RBAC leverages role bindings to determine which users and groups can access specific resources and what actions can be performed within those resources (roles). RBAC is empowered on the Kafka cluster by way of Confluent Server. Confluent Server is a fully compatible Kafka broker which integrates commercial security features like RBAC.

The list of items is pretty straightforward with no major surprises.

Comments closed

Feeding Databricks Output to Azure SQL Database

Published 2020-05-29 by Kevin Feasel

Arun Sirpal takes us through the process of moving data from Databricks into Azure SQL Database:

Recently I got to a stage where I leveraged Databricks to the best of my ability to join couple of CSV files together, play around some aggregations and then output it back to a different mount point ( based on Azure Storage) as a parquet file, I decided that I actually wanted to move this data into Azure SQL DB, which you may want to do one day.

This isn’t just dropping files into Blob Storage and picking them up, but rather a direct integration.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop