Hadoop – Page 16 – Curated SQL

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,
1. A report that runs regularly based on newer data and evolving business logic.
2. An ETL pipeline that needs to run on a regular schedule, or continuously.
3. A machine learning model that must be re-trained when new data arrives.
Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

Read on for several tips and recommendations.

Comments closed

PHI De-Identification in Databricks with NLP

Published 2022-06-24 by Kevin Feasel

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.
To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Comments closed

Git Native Support for Databricks Workflows

Published 2022-06-23 by Kevin Feasel

Vaibhav Sethi and Roland Faeustlin make an announcement:

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Read on to see how it works.

Comments closed

Improving Join Performance with Skewed Datasets in Spark

Published 2022-06-22 by Kevin Feasel

Ajay Gupta gets into the topic of join performance:

Performing Joins on Skewed Datasets: A Dataset is considered to be skewed for a Join operation when the distribution of join keys across the records in the dataset is skewed towards a small subset of keys. For example when 80% of records in the datasets contribute to only 20% of Join keys.
Implications of Skewed Datasets for Join: Skewed Datasets, if not handled appropriately, can lead to stragglers in the Join stage (Read this linked story to know more about Stragglers). This brings down the overall execution efficiency of the Spark job. Also, skewed datasets can cause memory overruns on certain executors leading to the failure of the Spark job. Therefore, it is important to identify and address Join-based stages where large skewed datasets are involved.

Read on for five techniques which may help you out.

Comments closed

Delta Live Tables and Power BI Data Modeling

Published 2022-06-21 by Kevin Feasel

Tahir Fayyaz goes from Delta Lake to Power BI:

To get the optimal performance from Power BI it is recommended to use a star schema data model and to make use of user-defined aggregated tables. However, as you build out your facts, dimensions, and aggregation tables and views in Delta Lake, ready to be used by the Power BI data model, it can become complicated to manage all the pipelines, dependencies, and data quality as you need to consider the following:
– How to easily develop and manage the data model’s transformation code.
– How to run and scale data pipelines for the model as data volumes grow.
– How to keep all the Delta Lake tables updated as new data arrives.
– How to view the lineage for all tables as the model gets more complex.
– How to actively stop data quality issues that result in incorrect reports.

Read on for recommendations, a couple architectural diagrams, and some sample code.

Comments closed

Automating Parallelism Decisions in Flink Batch Jobs

Published 2022-06-17 by Kevin Feasel

Lijie Wang and Zhu Zhu describe Apache Flink’s batch scheduler:

Deciding proper parallelisms of operators is not an easy work for many users. For batch jobs, a small parallelism may result in long execution time and big failover regression. While an unnecessary large parallelism may result in resource waste and more overhead cost in task deployment and network shuffling.
To decide a proper parallelism, one needs to know how much data each operator needs to process. However, It can be hard to predict data volume to be processed by a job because it can be different everyday. And it can be harder or even impossible (due to complex operators or UDFs) to predict data volume to be processed by each operator.
To solve this problem, we introduced the adaptive batch scheduler in Flink 1.15. The adaptive batch scheduler can automatically decide parallelism of an operator according to the size of its consumed datasets.

Read on to see some of the benefits of using the adaptive batch scheduler, as well as some of the decision points it uses along the way.

Comments closed

Generating an Expression Variable for Joins with PySpark

Published 2022-06-15 by Kevin Feasel

Unmesha Sreeveni uses a variable to effect a join in PySpark:

Lets see how to join 2 table with a parameterized on condition in PySpark
Eg: I have 2 dataframes A and B and I want to join them with id,inv_no,item and subitem

Click through to see how. It turns out to be pretty straightforward.

Comments closed

Unity Catalog in Azure Databricks

Published 2022-06-10 by Kevin Feasel

Paul Roome, et al, announce Unity Catalog:

We are excited to announce that data lineage for Unity Catalog, the unified governance solution for all data and AI assets on lakehouse, is now available in preview.
This blog will discuss the importance of data lineage, some of the common use cases, our vision for better data transparency and data understanding with data lineage, and a sneak peek into some of the data provenance and governance features we’re building.

Click through to see what it currently supports. My curious question is around whether this and Microsoft Purview will play nice in an Azure Databricks setup.

Comments closed

Request-Response and CQRS in Kafka

Published 2022-06-07 by Kevin Feasel

Kai Waehner compares two message exchange patterns:

How can I do request-response communication with Apache Kafka? That’s one of the most common questions I get regularly. This blog post explores when (not) to use this message exchange pattern, the differences between synchronous and asynchronous communication, the pros and cons compared to CQRS and event sourcing, and how to implement request-response within the data streaming infrastructure.

Read on to learn more.

Comments closed

Ingesting Event Hub Telemetry Data with PySpark Streaming

Published 2022-06-06 by Kevin Feasel

Charles Chukwudozie shows how to read from Event Hubs in Databricks with Python:

Ingesting, storing, and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub.
Most documented implementations of Azure Databricks Ingestion from Azure Event Hub Data are based on Scala.
So, in this post, I outline how to use PySpark on Azure Databricks to ingest and process telemetry data from an Azure Event Hub instance configured without Event Capture.

Click through for the process.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop

Software Engineering Practices for Notebooks

PHI De-Identification in Databricks with NLP

Git Native Support for Databricks Workflows

Improving Join Performance with Skewed Datasets in Spark

Delta Live Tables and Power BI Data Modeling

Automating Parallelism Decisions in Flink Batch Jobs

Generating an Expression Variable for Joins with PySpark

Unity Catalog in Azure Databricks

Request-Response and CQRS in Kafka

Ingesting Event Hub Telemetry Data with PySpark Streaming