Spark – Page 24 – Curated SQL

Generating Artificial Data with Databricks Generator

Published 2021-08-13 by Kevin Feasel

Databricks Labs is a relatively new offering from Databricks which showcases what their teams have been creating in the field to help their customers. As a Consultant, this makes my life a lot easier as I don’t have to re-invent the wheel and I can use it to demonstrate value in partnering with Databricks. There’s plenty of use cases that I’ll be using, and extending, with my client but the one I want to focus on in this post is the Data Generator.

Read on for an example of how this works. Something not in Ust’s post but worth mentioning is that you can control the distribution of random numeric features. That’s a piece of functionality you often don’t see in data generators.

1 Comment

How Spark Determines Task Numbers and Parallelism

Published 2021-08-06 by Kevin Feasel

The Hadoop in Real World team explains how the Spark engine decides how many tasks to create for a job and how many can run in parallel:

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.
Let’s see how Spark decides on the number of tasks with the below set of instructions.
[… instructions]
Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.

Click through for the full explanation.

Comments closed

Contrasting Scala and Python wrt Spark

Published 2021-07-20 by Kevin Feasel

Sanjay Rathore contrasts two of the three key Apache Spark languages:

Imagine the first day of a new Apache Spark project. The project manager looks at the team and says: which one to choose, scala or python. So let’s start with “scala vs python for spark”.
You may wonder if this is a tricky question. What does the enterprise demand say? Is this like asking iOS or Android? Is there a right or wrong answer?
So we are here to inform and provide clarity. Today we’re looking at two popular programming languages, Scala and Python, and comparing them in the context of Apache Spark and Big Data in general.

Read on for the comparison. I’m at a point where I think it’s wise to know both languages and roll with whichever is there. If you’re in a greenfield Spark implementation, pick the one you (or your team) is more comfortable with. If you’re equally comfortable with the two, pick Scala because it’s a functional programming language and those are neat.

Comments closed

Spark SQL and Merge Errors from Multiple Source Rows Matched

Published 2021-06-18 by Kevin Feasel

Manoj Pandey explains an error message in Spark SQL:

UnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge, when multiple source rows match on the same target row, the result may be ambiguous as it is unclear which source row should be used to update or delete the matching target row. You can preprocess the source table to eliminate the possibility of multiple matches. Please refer to https://docs.microsoft.com/azure/databricks/delta/delta-update#upsert-into-a-table-using-merge
The above error says that while doing MERGE operation on the Target table there shouldn’t be any duplicates in the Source table. This check is applied implicitly by the SQL engine to avoid unnecessary updates and avoid inconsistent data.

Read on for a reproduction and what you can do to resolve the issue.

Comments closed

Using Spark in CDP’s Operational Database Experience

Published 2021-06-14 by Kevin Feasel

Gokul Kamaraj, et al, take us through using Apache Spark in Cloudera Data Platform’s Operational Database Experience:

Apache Spark is a very popular analytics engine used for large-scale data processing. It is widely used for many big data applications and use cases. CDP Operational Database Experience Experience (COD) is a CDP Public Cloud service that lets you create and manage operational database instances and it is powered by Apache HBase and Apache Phoenix.
To know more about Apache Spark in CDP and CDP Operational Database Experience, see Apache Spark Overview and CDP Operational Database Experience Overview.
Apache Spark enables you to connect directly to databases that support JDBC. When integrating Apache Spark with Apache Phoenix in COD, you can leverage capabilities provided by Apache Phoenix to save and query data across multiple worker nodes, and use SELECT columns and pushdown predicates for filtering.
In this blog post, let us look at how you can read and write data to COD from Apache Spark. We are going to use an Operational Database COD instance and Apache Spark present in the Cloudera Data Engineering experience.

Read on for the process.

Comments closed

Change Data Capture in Delta Lake

Published 2021-06-14 by Kevin Feasel

Surya Sai Turaga and John O’Dwyer take us through change data capture in Delta Lake:

Change data capture (CDC) is a use case that we see many customers implement in Databricks – you can check out our previous deep dive on the topic here. Typically we see CDC used in an ingestion to analytics architecture called the medallion architecture. The medallion architecture that takes raw data landed from source systems and refines the data through bronze, silver and gold tables. CDC and the medallion architecture provide multiple benefits to users since only changed or added data needs to be processed. In addition, the different tables in the architecture allow different personas, such as Data Scientists and BI Analysts, to use the correct up-to-date data for their needs. We are happy to announce the exciting new Change Data Feed (CDF) feature in Delta Lake that makes this architecture simpler to implement and the MERGE operation and log versioning of Delta Lake possible!

Read on to gain an understanding of how it works.

Comments closed

Announcements from Data+AI Summit

Published 2021-06-08 by Kevin Feasel

Ryan Boyd summarizes Databricks announcements:

The Delta Lake open source project is a key enabler of the lakehouse, as it fixes many of the limitations of data lakes: data quality, performance and governance. The project has come a long way since its initial release, and the Delta Lake 1.0 release was just certified by the community. The release represents a variety of new features, including generated columns and cloud independence with multi-cluster writes and my favorite — Delta Lake standalone, which reads from Delta tables but doesn’t require Apache Spark^TM.
We also announced a bunch of new committers to the Delta Lake project, including QP Hou, R.Tyler Croy, Christian Williams, Mykhailo Osypov and Florian Valeye.
Learn more about Delta Lake 1.0 in the keynotes from co-creator and Distinguished Engineer Michael Armbrust.

Read on for a variety of announcements in this vein.

Comments closed

Saving Spark DataFrames as Hive Tables

Published 2021-06-03 by Kevin Feasel

The Hadoop in Real World team shows how you can save Spark DataFrames as Hive tables:

It is a very common use case to process the data in Spark and save the processed data or Spark dataframe directly into a Hive table.
There are a couple ways to achieve this.

Click through for those techniques.

Comments closed

Securing Databricks on AWS

Published 2021-05-27 by Kevin Feasel

Andrew Weaver, et al, take us through security practices for running Databricks on AWS:

In this article, we will share a list of cloud security features and capabilities that an enterprise data team can use to harden their Databricks environment on AWS as per their risk profile and governance policy. For more information about how Databricks runs on Amazon Web Services (AWS), view the AWS web page and Databricks security on AWS page for more specific details on security and compliance.

Click through for that list.

Comments closed

Azure Synapse Analytics Supports Apache Spark 3.0

Published 2021-05-26 by Kevin Feasel

Euan Garden has some great news for us:

Starting today, the Apache Spark 3.0 runtime is now available in Azure Synapse. This version builds on top of existing open source and Microsoft specific enhancements to include additional unique improvements listed below. The combination of these enhancements results in a significantly faster processing capability than the open-source Spark 3.0.2 and 2.4.
The public preview announced today starts with the foundation based on the open-source Apache Spark 3.0 branch with subsequent updates leading up to a Generally Available version derived from the latest 3.1 branch.

It still won’t be as fast as Databricks, but it should be a good bit faster than the Spark 2 they were running.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark