Press "Enter" to skip to content

Category: Spark

Databricks Serverless SQL

Nikhil Jethava and Kevin Clugage announce serverless SQL on Databricks:

Databricks SQL already provides a first-class user experience for BI and SQL directly on the data lake, and today, we are excited to announce another step in making data and AI simple with Databricks Serverless SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by an average of 40%. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time datasets of the lakehouse with a simple and performant solution.

Under the hood of this capability is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries, typically in about 15 seconds. The best part? You only pay for Serverless SQL when users start running reports or queries.

Things are getting interesting between Databricks and Azure Synapse Analytics, as both now have serverless SQL and Spark offerings. Synapse Analytics has the better implementation for serverless SQL and Databricks the superior Spark implementation, so it becomes a question of which weakness you take in order to gain the strength.

1 Comment

Databricks Autologging

Corey Zumar and Kasey Uhlenhuth announce a new product:

Machine learning teams require the ability to reproduce and explain their results–whether for regulatory, debugging or other purposes. This means every production model must have a record of its lineage and performance characteristics. While some ML practitioners diligently version their source code, hyperparameters and performance metrics, others find it cumbersome or distracting from their rapid prototyping. As a result, data teams encounter three primary challenges when recording this information: (1) standardizing machine learning artifacts tracked across ML teams, (2) ensuring reproducibility and auditability across a diverse set of ML problems and (3) maintaining readable code across many logging calls.

Read on to see how Databricks Autologging can satisfy these issues.

Comments closed

Generating Artificial Data with Databricks Generator

Ust Oldfield shows off a new tool:

Databricks Labs is a relatively new offering from Databricks which showcases what their teams have been creating in the field to help their customers. As a Consultant, this makes my life a lot easier as I don’t have to re-invent the wheel and I can use it to demonstrate value in partnering with Databricks. There’s plenty of use cases that I’ll be using, and extending, with my client but the one I want to focus on in this post is the Data Generator.

Read on for an example of how this works. Something not in Ust’s post but worth mentioning is that you can control the distribution of random numeric features. That’s a piece of functionality you often don’t see in data generators.

1 Comment

How Spark Determines Task Numbers and Parallelism

The Hadoop in Real World team explains how the Spark engine decides how many tasks to create for a job and how many can run in parallel:

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.

Let’s see how Spark decides on the number of tasks with the below set of instructions.

[… instructions]

Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.

Click through for the full explanation.

Comments closed

Contrasting Scala and Python wrt Spark

Sanjay Rathore contrasts two of the three key Apache Spark languages:

Imagine the first day of a new Apache Spark project. The project manager looks at the team and says: which one to choose, scala or python. So let’s start with “scala vs python for spark”. 

You may wonder if this is a tricky question. What does the enterprise demand say? Is this like asking iOS or Android? Is there a right or wrong answer?

So we are here to inform and provide clarity. Today we’re looking at two popular programming languages, Scala and Python, and comparing them in the context of Apache Spark and Big Data in general.

Read on for the comparison. I’m at a point where I think it’s wise to know both languages and roll with whichever is there. If you’re in a greenfield Spark implementation, pick the one you (or your team) is more comfortable with. If you’re equally comfortable with the two, pick Scala because it’s a functional programming language and those are neat.

Comments closed

Spark SQL and Merge Errors from Multiple Source Rows Matched

Manoj Pandey explains an error message in Spark SQL:

UnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge, when multiple source rows match on the same target row, the result may be ambiguous as it is unclear which source row should be used to update or delete the matching target row. You can preprocess the source table to eliminate the possibility of multiple matches. Please refer to https://docs.microsoft.com/azure/databricks/delta/delta-update#upsert-into-a-table-using-merge

The above error says that while doing MERGE operation on the Target table there shouldn’t be any duplicates in the Source table. This check is applied implicitly by the SQL engine to avoid unnecessary updates and avoid inconsistent data.

Read on for a reproduction and what you can do to resolve the issue.

Comments closed

Using Spark in CDP’s Operational Database Experience

Gokul Kamaraj, et al, take us through using Apache Spark in Cloudera Data Platform’s Operational Database Experience:

Apache Spark is a very popular analytics engine used for large-scale data processing. It is widely used for many big data applications and use cases. CDP Operational Database Experience Experience (COD) is a CDP Public Cloud service that lets you create and manage operational database instances and it is powered by Apache HBase and Apache Phoenix. 

To know more about Apache Spark in CDP and CDP Operational Database Experience, see Apache Spark Overview and CDP Operational Database Experience Overview.

Apache Spark enables you to connect directly to databases that support JDBC. When integrating Apache Spark with Apache Phoenix in COD, you can leverage capabilities provided by Apache Phoenix to save and query data across multiple worker nodes, and use SELECT columns and pushdown predicates for filtering. 

In this blog post, let us look at how you can read and write data to COD from Apache Spark. We are going to use an Operational Database COD instance and Apache Spark present in the Cloudera Data Engineering experience

Read on for the process.

Comments closed

Change Data Capture in Delta Lake

Surya Sai Turaga and John O’Dwyer take us through change data capture in Delta Lake:

Change data capture (CDC) is a use case that we see many customers implement in Databricks – you can check out our previous deep dive on the topic here. Typically we see CDC used in an ingestion to analytics architecture called the medallion architecture. The medallion architecture that takes raw data landed from source systems and refines the data through bronze, silver and gold tables. CDC and the medallion architecture provide multiple benefits to users since only changed or added data needs to be processed. In addition, the different tables in the architecture allow different personas, such as Data Scientists and BI Analysts, to use the correct up-to-date data for their needs. We are happy to announce the exciting new Change Data Feed (CDF) feature in Delta Lake that makes this architecture simpler to implement and the MERGE operation and log versioning of Delta Lake possible!

Read on to gain an understanding of how it works.

Comments closed

Announcements from Data+AI Summit

Ryan Boyd summarizes Databricks announcements:

The Delta Lake open source project is a key enabler of the lakehouse, as it fixes many of the limitations of data lakes: data quality, performance and governance. The project has come a long way since its initial release, and the Delta Lake 1.0 release was just certified by the community. The release represents a variety of new features, including generated columns and cloud independence with multi-cluster writes and my favorite — Delta Lake standalone, which reads from Delta tables but doesn’t require Apache SparkTM.

We also announced a bunch of new committers to the Delta Lake project, including QP Hou, R.Tyler Croy, Christian Williams, Mykhailo Osypov and Florian Valeye.

Learn more about Delta Lake 1.0 in the keynotes from co-creator and Distinguished Engineer Michael Armbrust.

Read on for a variety of announcements in this vein.

Comments closed