Press "Enter" to skip to content

Month: July 2022

Summarizing Data & AI Summit Announcements

Zach Stagers hits the high notes:

One of the biggest cheers of the keynote was that Delta is being fully open sourced! Databricks continue to share their incredible work to help drive our industry forward. Delta already has wide adoption, but with the open sourced version now being levelled up to the same standard as the ‘proprietary’ one, this should help cement it as the default choice for lake-based storage.

There were some announcements of things to come with Delta too, such as a optimised deletes and updates by removing single rows instead of having to completely rewrite the file. It’ll be really interesting to see how this works, and just how much it boosts performance.

Read on for more notes on several big announcements.

Comments closed

Databases, Applications, and Source Control Repos

Eitan Blumin asks and answers a question:

Following the rise in popularity of DevOps for Databases, many interesting questions are being asked on the topic.

One of these questions is: Should your SQL Database project be in the same source control repository and solution as the App code project? Or maybe they should be in the same repository but separate solutions? Or maybe they should be in completely separate repositories?

Pre-registering my answer here: for most organizations, databases should be in a separate repository. The deployment cadence is different, the deployment mechanism is different, and the people working on each likely differ. Read on for Eitan’s thoughts, which get into more of the nuance behind the answer.

Comments closed

Azure Data Explorer Web Updates

Michal Bar has a few updates to the Azure Data Explorer web tool:

We are focused on continuously improving the results exploration experience in ADX web UI, to make it easy and intuitive. Our goal is to provide an easy-to-use UI so that you will not be required to re-write KQL queries in order to perform light-weight data exploration.

Click through to see how you can search and filter within the results pane (something I’d like to see in other Microsoft data platform tools like SSMS), create series panels on charts from KQL, and more.

Comments closed

Filling in GitHub Repo Details

Kevin Chant practices GitHub hygeine:

To clarify, GitHub hygiene is a term that I use to describe the practice of keeping GitHub repositories healthy.

Some of you have probably noticed I have been doing this more recently. With this in mind, I thought I would share what I have been doing in this post for a couple of reasons.

First of all, to help raise awareness about some of the best practices I have been doing.

Secondly, because I am interested to get feedback from other members of the Microsoft Data Platform community about this. For example, do you also follow the same practices?

This is a reminder that there’s a lot you can include in a GitHub repo aside from the code itself.

Comments closed

More Notes on Parameter Sensitive Plan Optimization

Erik Darling continues a series on Parameter Sensitive Plan Optimization. First up is a missed opportunity:

If we re-run the procedure from the section up above to search for VoteTypeId 2 a second time, Memory Grant Feedback will fix the spill at the Hash Join, and bring the total execution time down to about 15 seconds.

That is an improvement, but… Look at the plan here. If VoteTypeId 2 uses a plan more suited to the number of rows it has to process, the overall time is around 4 seconds, with no need for a memory grant correction.

Second, Erik asks the pressing questions:

Under more favorable circumstances, dynamic SQL gets run, executed, and plans cached and reused with the same frequency as stored procedures.

Now, dynamic SQL isn’t exactly the same as stored procedures. There’s a lot you can do with those that just looks a mess in dynamic SQL, especially longer bits of code.

In today’s post, we’re going to look at how the Parameter Sensitive Plan (PSP) optimization works with dynamic SQL.

Read on to see if PSP works with dynamic SQL.

Comments closed

Tools for a Jump Box

Tracy Boggiano looks in the tool bag:

Having recently taken a new job and introducing a number of new tools to my new coworker I thought I’d share how I setup my jump box to and keep it updated so others can benefit, and I can find it later (I did put this in our internal Confluence pages, but I do have a box in Azure for presentations). In alphabetical order because that’s the only way to make sense of things.  I’d be curious if anybody has anything they use that I should add, so please leave comments.

Read on to see the list, including all of the Azure Data Studio and Visual Studio Code extensions Tracy likes to use.

Comments closed

SET, SELECT, and Variable Assignment

Chad Callihan obliquely reminds us to create those unique constraints (by way of unique indexes):

Did you know there is more than one way to set a variable in SQL Server? You can actually set a variable without using “SET” at all. Let’s look at an example that shows how setting a variable with SELECT can cause a headache when dealing with identical values.

Click through to see the problem in action. One way around this if you do know you are dealing with duplicates and need a specific one is to SELECT TOP(1) with an appropriate ORDER BY clause, just as you would if variable assignment weren’t on the table.

Comments closed

Locking Mechanisms in Apache Hive

Shobika Selvaraj documents lock types in Apache Hive and what commands acquire which types:

In the Shared lock there are two types one is Shared_read and Shared_write. Shared_read means anyother shared_read and shared_write query can run at a time. Shared_write lock which means any other shared_read can be performed but no shared_write lock can acquire at that time.

In Exclusive locks no shared_read or shared_write can perform at the same time.

There are three types of lock state:

   (a) Acquired – transaction initiator hold the lock
   (b) Waiting – transaction initiator is waiting for the lock
   (c) Aborted – the lock has timed out but has not yet been cleaned

I was a bit surprised about inserts being shared read but that’s not a typo in the table—Shobika brings receipts.

Comments closed

Stacked Bar Charts

Alex Velez takes us through stacked bar charts:

A few years ago, we posted a question on this blog that is as relevant today as it was years ago: “Is there a good use case for a stacked bar chart?” 

Stacked bars are everywhere; you’ve likely seen them in a recent report, a dashboard, or in the media. Despite their prevalence, they are commonly both misused and misunderstood. In this guide, we’ll aim to rectify these mishaps by sharing examples, clarifying when you should (and shouldn’t) use a stacked bar chart, and discussing best practices for stacking bars. 

Read on for plenty of good advice around when to use stacked (either regular stacked bar charts or 100% stacked), horizontal vs vertical, and how to format them when it does make sense to drop one in.

Comments closed

Date-Time Binning in Cosmos DB

Hasan Savran bins some data:

I wrote about the Date_Bucket() function in SQL Server a couple weeks ago. Azure Cosmos DB team announced the same functionality with a different name DateTimeBin() function. It works exactly the same with the Date_Bucket() function of SQL Server.

     Cosmos DB version of the function has the same number of parameters. The order is different. All the datatime parameters must be in ISO 8601 format (YYYY-MM-DDThh:mm:ss.fffffffZ)

Read on to see how it works.

Comments closed