Hadoop – Page 40 – Curated SQL

The simplest way to describe a RTDW is that it looks and feels like a normal data warehouse, but everything is faster even while massive scale is maintained. It is a type of data warehouse modernization that lets you have “small data” semantics and performance at “big data” scale.
– the data arrives into the warehouse faster – think streams of many millions of events per second constantly arriving
– the time it takes for the data to be optimally queryable is faster – query immediately upon arrival with no need for processing or aggregation or compaction
– the speed at which queries run is faster – small, selective queries are measured in 10s or 100s of milliseconds; large, scan- or compute-heavy queries are processed at very high bandwidth
– mutations of the data, when needed, are fast – if data needs to be corrected or updated for whatever reason, this can be done in place without large rewrites

Read on for more.

Comments closed

Join Execution in Apache Spark

Published 2020-11-04 by Kevin Feasel

Ajay Gupta takes us through join operations in Apache Spark:

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.
At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an individual record, being matched, from either of the two data sets or a Joined record. The joined record basically represents the combination of individual records, being matched, from both the data sets.

Click through for more information on the mechanics of joining, including trade-offs between types of physical join operators.

Comments closed

Adding Libraries in Databricks

Published 2020-11-03 by Kevin Feasel

Arun Sirpal has some third-party libraries to add:

It is a really common requirement to add specific libraries to databricks. Libraries can be written in Python, Java, Scala, and R. You can upload Java, Scala, and Python libraries and point to external packages in PyPI, Maven, and CRAN repositories.
Libraries can be added in 3 scopes. Workspace, Notebook-scoped and cluster. I want to show you have easy it is to add (and search) for a library that you can add to the cluster, so that all notebooks attached to the cluster can leverage the library.

I’m hoping that loading libraries in Azure Synapse Analytics will, at some point, be this convenient.

Comments closed

Dotnet-Spark UDFs and Missing Shared State

Published 2020-11-02 by Kevin Feasel

Ed Elliott uncovers a mystery:

To understand this we need to take a look at how we can create a UDF in .NET that is called by the Java VM Apache Spark code because, that is logically, what happens. In our application we call into Apache Spark and ask it to do things like read from a file, run some transformation and write files back out again. With UDF’s, we ask Spark to run a UDF and Spark comes back to our UDF, passing it some data and asks the UDF to execute but the Java VM does not understand how to execute .NET code.

Read the whole thing.

Comments closed

Preparing for the Kafka-Zookeeper Breakup

Published 2020-10-30 by Kevin Feasel

Yeva Byzek prepares us:

As described in the blog post Apache Kafka^® Needs No Keeper: Removing the Apache ZooKeeper Dependency, when KIP-500 lands next year, Apache Kafka will replace its usage of Apache ZooKeeper with its own built-in consensus layer. This means that you’ll be able to remove ZooKeeper from your Apache Kafka deployments so that the only thing you need to run Kafka is…Kafka itself. Kafka’s new architecture provides three distinct benefits. First, it simplifies the architecture by consolidating metadata in Kafka itself, rather than splitting it between Kafka and ZooKeeper. This improves stability, simplifies the software, and makes it easier to monitor, administer, and support Kafka. Second, it improves control plane performance, enabling clusters to scale to millions of partitions. Finally, it allows Kafka to have a single security model for the whole system, rather than having one for Kafka and one for Zookeeper. Together, these three benefits greatly simplify overall infrastructure design and operational workflows.

Read on to see where this story is at and what kinds of changes you’ll have to make to code.

Comments closed

Using Key Vault in Azure Databricks

Published 2020-10-29 by Kevin Feasel

Arun Sirpal shows us how easy it is to tie Azure Key Vault into Azure Databricks:

The key vault should always be a core component of your Azure design because we can store keys, secrets, certicates thus abstract / hide the true connection string within files. When working with databricks to mount storage to ingest your data and query it ideally you should be leveraging this to create secrets and secret scopes.

Click through for a demo.

Comments closed

Outlier Identification Using Spark 3.0

Published 2020-10-27 by Kevin Feasel

Tori Tompkins takes us through principles of anomaly detection in Apache Spark 3.0:

To calculate Median Absolute Deviation (MAD) you need to calculate the difference between the value and the median. In simpler terms, you will need to calculate the median of the entire dataset, the difference between each value and this median, then take another median of all the differences.
In Spark you can use a SQL expression ‘percentile()’ to calculate any medians or quartiles in a dataframe. ‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ because we want the 50% value ie median) and will return an array of results.
Like standard deviation, to use MAD to identify the outliers it needs to be a certain number of MAD’s away. This number is also referred to as the threshold and is defaulted to 3.

Read on for three measures and their implementations in PySpark.

Comments closed

Features and Improvements in Spark 3.0

Published 2020-10-27 by Kevin Feasel

Manoj Pandey summarizes some of the improvements in Apache Spark 3.0:

With Spark 3.0 release (on June 2020) there are some major improvements over the previous releases, some of the main and exciting features for Spark SQL & Scala developers are AQE (Adaptive Query Execution), Dynamic Partition Pruning and other performance optimization and enhancements.
Below I’ve listed out these new features and enhancements all together in one page for better understanding and future reference.

Click through for the summary.

Comments closed

Intrusion Detection using ksqldb

Published 2020-10-26 by Kevin Feasel

Maxime Ribera and Geraud Duge de Beronville take us through an interesting tutorial:

Apache Kafka^® is a distributed real-time processing platform that allows for the ingestion of huge volumes of data. ksqlDB is part of the Kafka ecosystem and offers a SQL-like language to query and process large-scale, real-time data. This blog post demonstrates how to quickly process network activity for detection intrusion using both Kafka and ksqlDB.
For testing purposes (and to avoid being banned from the enterprise network), a virtualized environment through Vagrant is used.

Click through for the scenario.

Comments closed

Adaptive Query Execution in Databricks

Published 2020-10-26 by Kevin Feasel

MaryAnn Xue and Allison Wang explain how Adaptive Query Execution works with Databricks:

One of the most important cost-based decisions made in the Spark optimizer is the selection of join strategies, which is based on the size estimation of the join relations. But since this estimation can go wrong in both directions, it can either result in a less efficient join strategy because of overestimation, or even worse, out-of-memory errors because of underestimation.
AQE offers a trouble-free solution here by switching to the faster broadcast hash join during execution time.

This is pretty similar to Adaptive Query Processing in SQL Server.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Hadoop

Real-Time Data Warehousing in Cloudera

Join Execution in Apache Spark

Adding Libraries in Databricks

Dotnet-Spark UDFs and Missing Shared State

Preparing for the Kafka-Zookeeper Breakup

Using Key Vault in Azure Databricks

Outlier Identification Using Spark 3.0

Features and Improvements in Spark 3.0

Intrusion Detection using ksqldb

Adaptive Query Execution in Databricks