Press "Enter" to skip to content

Category: Spark

Comparing Databricks to Synapse Spark Pools

Corrinna Peters makes comparisons:

There are different cases for using both depending on the specific needs and requirements, Synapse and Databricks are similar, but both have their own areas of specialities or rather areas where they are above the other.

Data Lake – they both allow you to query the data from the data lake, Synapse uses either the SQL on demand pool or Spark and Databricks uses the Databricks workspace once you have mounted the data lake. If you are predominately a SQL user and prefer the code and the BI developer feel then Synapse would be the correct choice whereas if you are a Data Scientist and prefer to code in Python or R then Databricks would feel more at home.

Read on for a nuanced take. My less nuanced take is, Databricks beats the pants off of Synapse Spark pools in terms of performance. Synapse has a much better overall ecosystem, expanding beyond Spark and into T-SQL (in two flavors) and log/event analytics with KQL. If you’re spending 100% of your time in Spark and don’t care about the rest, use Databricks; if Spark is a relatively small part of your warehousing work, use Synapse.

1 Comment

Azure Databricks Security Considerations

Craig Porteous provides some advice on configuring Azure Databricks:

Azure Databricks is an analytics platform and often serves as the central compute component of a data platform, to process ETL/ELT data pipelines and data science workloads. As Databricks is a third-party platform-as-a-service offering securing it works differently to most other first-party services in Azure; for example, we can’t use private endpoints. (More on these in the Azure Storage post)

The two main approaches to working with Databricks in our secure platform are VNet Peering or VNet Injection

Click through to learn the difference between these two, as well as a few other factors to keep in mind as you’re deploying Databricks.

Comments closed

From Confluent Cloud into Azure Synapse Analytics

Jacob Bogie and Dustin Vannoy show how to integrate Kafka in Confluent Cloud with pools in Azure Synapse Analytics:

Just released this fall, is the fully managed Synapse Connector. Azure Synapse Analytics provides a platform for data analysts and data scientists to analyze and combine data from multiple sources. Within Confluent Cloud, data can be synched to dedicated SQL pools via the fully managed Synapse sink connector and attached to Synapse Analytics workspace. Once added to the Synapse Analytics workspace, analysts have the ability to perform advanced analytics and reporting on data in the Confluent pipeline. The ability to access event-level data enables event-level analytics and data exploration.

Click through for two examples, one of loading data into a dedicated SQL pool and one of streaming data into Spark Streaming running on (naturally) a Spark pool.

Comments closed

Right to Be Forgotten in Delta Lake

Milos Colic, et al, tackle a tricky problem:

With Delta, we have one more tool at our disposal to address GDPR compliance and, in particular, “the right to be forgotten” – VACUUM. Vacuum operation removes the files that are no longer needed and that are older than a predefined retention period. The default retention period is 30 days to align with GDPR definition of undue delay. Our earlier blog on a similar topic explains in detail how you can find and delete personal information related to a consumer by running two commands:

The part I’m finding tricky here is, how does this handle “time travel” scenarios in which you’re looking at prior iterations of data? I haven’t run through all of the scenarios so this is just speculation, but it seems that even with all of these changes, you’d still have to worry about historical data containing that sensitive information.

Comments closed

Writing a Single JSON File in Databricks

Falek Miah performs a surprisingly difficult task:

When writing to a JSON destination using the DataFrameWriter the dataset is split into multiple files to reflect the number of RDD partitions in the dataframe when in memory – this is the most efficient way for Spark to write data out.

However, this creates a directory containing the data files, as well as Spark metadata files…but what if you just wanted a single JSON file? It’s a scenario that comes up a lot with our clients and, despite it not being the most efficient way to use Spark, we need to implement it all the same.

Click through to see how to do this, including the removal of all metadata files (committed, started, and success files).

Comments closed

Intelligent Cache for Spark in Synapse

Avinanda Chattapadday makes an announcement:

Traditionally, when querying a file or table from your data lake, the Apache Spark engine in Synapse makes a call to your remote ADLS Gen2 storage for each read of the data. For workloads with frequent repeat queries, this process can be redundant and add latency to the overall processing time. Although Apache Spark provides a great caching feature, it must be manually set and released to minimize the latency and improve overall performance. It can also result in queries of stale data if the underlying data changes. This is where the intelligent cache in Azure Synapse can simplify the process; by automatically detecting changes to the underlying files and automatically refreshing them in the cache, you ensure you have access to the most recent data. When the cache reaches its size limit, it will automatically release the least-read data to make space for more recent data.

Click through to see how you can enable this, as well as a few more details on the process.

Comments closed

Building a Recommender in Spark

Avinash Sooriyarachchi makes a recommendation:

There has been an exponential increase in the volume and variety of data at our disposal to build recommenders and notable advances in compute and algorithms to utilize in the process. Particularly, the means to store, process and learn from image data has dramatically increased in the past several years. This allows retailers to go beyond simple collaborative filtering algorithms and utilize more complex methods, such as image classification and deep convolutional neural networks, that can take into account the visual similarity of items as an input for making recommendations. This is especially important given online shopping is a largely visual experience and many consumer goods are judged on aesthetics.

In this article, we’ll change the script and show the end-to-end process for training and deploying an image-based similarity model that can serve as the foundation for a recommender system. Furthermore, we’ll show how the underlying distributed compute available in Databricks can help scale the training process and how foundational components of the Lakehouse, Delta Lake and MLflow, can make this process simple and reproducible.

Click through for the process.

Comments closed