Press "Enter" to skip to content

Day: October 18, 2021

Creating Delta Lake Tables in Azure Databricks

Gauri Mahajan takes us through creating new tables in a Delta Lake using Azure Databricks:

Delta lake is an open-source data format that provides ACID transactions, data reliability, query performance, data caching and indexing, and many other benefits. Delta lake can be thought of as an extension of existing data lakes and can be configured per the data requirements. Azure Databricks has a delta engine as one of the core components that facilitates delta lake format for data engineering and performance. Delta lake format is used to create modern data lake or lakehouse architectures. It is also used to build a combined streaming and batch architecture popularly known as lambda architecture.

Click through for the process.

Comments closed

Contrasting Kafka with Azure Service Bus

Ritam Das explains the differences between Apache Kafka and Azure Service Bus:

 It is important to note that Azure Service Bus is a traditional message broker and tailored to somewhat different use cases when compared to Kafka. Simply transferring between these two technologies is not an easy feat and would require overhauling your entire application. The comparison stops at both technologies being message brokers as under the hood they are fundamentally different. 

At a high level, ASB has high processing overhead per message, stronger guarantees around delivery and processing, and typically a “process once” model. Kafka has low overhead processing per message, fewer guarantees around delivery and processing, and typically a “publish once, process multiple times” model. To provide an explicit comparison, it would be best to understand the intended use case and proceed from there. 

Read on to understand the best uses for each technology, as well as sample calls using Python.

Comments closed

Architecting a Jenkins Replacement

Li Haoyi takes us through an internal Databricks tool for continuous integration:

Runbot is a bespoke continuous integration (CI) solution developed specifically for Databricks’ needs. Originally developed in 2019, Runbot incrementally replaces our aging Jenkins infrastructure with something more performant, scalable, and user friendly for both users and maintainers of the service. This blog post will explore the motivations behind developing Runbot, the core design decisions that went into it, and how we used it to greatly improve the experience of all the developers within the Databircks engineering organization.

It doesn’t look like the tool is available externally, but it’s an interesting read and helps understand some of the “why” behind the solution.

Comments closed

Moving Files Associated with Availability Groups

Eitan Blumin has a doozy of a short script:

Today, I’m sharing with you a cool Powershell script that basically implements the methodology necessary to move database files to a new location in AlwaysOn Availability Groups, without breaking HADR.

It’s based on a few very useful step-by-step guides on the topic such as this one and this one and this one. But it takes it a step further by being a single cohesive Powershell script that does everything end-to-end.

Well… Almost everything… The only thing it’s missing is somehow disabling any SQL Agent jobs that may be performing backups. I still haven’t figured out how to possibly automate such a thing, so you’d have to do that manually on your own.

Click through for instructions, notes, and warnings, as well as the script itself.

Comments closed

Things You Might Not Need in SQL Server

Erik Darling has two posts of a similar theme. First up is that you might not need to offload reads:

Duplicating data for reporting, outside of moving it to a data warehouse where there’s some transformations involved, can be an expensive and perilous task.

Your options come down to a native solution like AGs, Replication, or Log Shipping. You can brew something up yourself that relies on native stuff too, like Change Data Capture, Change Tracking, Temporal Tables, or triggers.

Erik’s suggestion here is that appropriate query tuning (and I’ll add proper database design!) does more for you than scaling out.

Then, Erik takes it one step further and recommends against certain features in SQL Server:

Consulting gives you a lot of opportunities to talk to a lot of people and deal with interesting issues.

Recently it occurred to me that a lot of people seem to confer magic button status to a lot of things that always seem to be If-I-Could-Only-Do-This features that would solve all their problems, and similarly a Thing-That-Solved-One-Problem-Once turned into something that got used everywhere.

I do agree with Erik on partitioning (which makes administration easier but usually only helps with read queries on columnstore indexes with huge numbers of rows), In-Memory OLTP (which could have been an incredible feature if it worked as I’d hoped), dirty reads, and over-use of recompilation. For sufficiently busy environments, I disagree with Erik’s take on fill factor, having been convinced by Jeff Moden that there’s a lot of value in well-thought-out fill factor settings, but to make the most of it requires more knowledge of the data than DBAs typically take the time to learn.

Comments closed