Press "Enter" to skip to content

Curated SQL Posts

Formatting Binary LSN Values

Michael J. Swart does a bit of shuffling:

Typically as developers, we don’t care about these values. But when we do want to dig into the transaction log, we can do so with sys.fn_dblog which takes two optional parameters. These parameters are LSN values which limit the results of sys.fn_dblog. But the weird thing is that sys.fn_dblogis a function whose LSN parameters are NVARCHAR(25).

The function sys.fn_dblog doesn’t expect binary(10) values for its LSN parameters, it wants the LSN values as a formatted string, something like: 0x00000029:00001a3c:0002.

Never fear, though: Michael’s got us covered. Click through for a conversion function.

Comments closed

Creating Identity Columns in Databricks

Franco Patano generates some identity integers:

Identity columns solve the issues mentioned above and provide a simple, performant solution for generating surrogate keys. Delta Lake is the first data lake protocol to enable identity columns for surrogate key generation.

Delta Lake now supports creating IDENTITY columns that can automatically generate unique, auto-incrementing ID numbers when new rows are loaded. While these ID numbers may not be consecutive, Delta makes the best effort to keep the gap as small as possible. You can use this feature to create surrogate keys for your data warehousing workloads easily.

This is a bit light on explanation, unfortunately. With distributed systems, generating identities is historically tricky (especially with several independent nodes generating values) so I’d be curious to see how it works: do they allocate blocks of IDs to worker nodes or do something else? And are the IDs guaranteed to be monotonically increasing? Or is there some other service which “labels” the data upon insert and provides those IDs?

Comments closed

Understanding Decision Trees

Durgesh Gupta provides a primer on the humble decision tree:

A decision tree is a graphical representation of all possible solutions to a decision.

The objective of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.

It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

The way I like to describe decision trees, especially to developers, is that a tree is a set of if-else statements which leads to a conclusion. The nice part about decision trees is that once you understand how they work, you’re halfway there to gradient boosting (e.g., XGBoost) and random forests.

Comments closed

Date Arithmetic in KQL

Robert Cain continues a series on KQL:

Performing DateTime arithmetic in Kusto is very easy. You simply take one DateTime data type object and apply standard math to it, such as addition, subtraction, and more. In this post we’ll see some examples of the most common DateTime arithmetic done when authoring KQL.

Read on for several examples of how it all works.

Comments closed

Azure Functions and Azure Database Options

Sarah Dutkiewicz continues a series on learning Azure. First up, Azure Functions:

Azure Functions are not something you’ll see rendered on a front-end somewhere. They’re a serverless solution used for doing things in the back-end and the middle tier. 

After that, Sarah touches on database options:

There are many databases on Azure – including relational data in Azure SQL, NoSQL with Azure Cosmos DB, and even some popular databases in the open source realm such as MySQL and PostgreSQL. These are just a few of the data stores available. Check this page of Azure Databases for a matrix of the databases available compared by their features.

Click through for quite a few links and information on when to use what.

Comments closed

Neo4j Imports and Case Sensitivity

Steve Jones is getting me in a ranting mood:

I kept editing the file and trying different things. I compared what I had locally with what was on GitHub. Eventually, I realized this is the issue:

{employeeID:row.EmployeeID}

In the GitHub csv, the first row has headers with EmployeeID. In my local file, the header is “employeeID” (lower case). As soon as I edited this, it worked.

Case sensitivity is a big historical mistake.

Comments closed

Theming and Contrast Adjustments for Diffify

Tim Brock continues a series on theming diffify:

It’s difficult to design a website that is “just right” for everyone. For instance, while reds and greens can be difficult to discern for some dichromats and anomalous trichromats, most trichromats have no such problem (peak daylight sensitivity lies in the yellow part of the spectrum, between red and green). Moreover, these colours also have common cultural semantics (though these do, of course, vary by culture). We also care about aesthetics.

Because of this conflict and more besides, we decided the best approach to making the site more accessible was through “theming”. 

Click through to see what this entails.

Comments closed

Serverless Compute for Databricks SQL

Nikhil Jethava and Shankar Sivadasan make an announcement:

We are excited to announce the preview of Serverless compute for Databricks SQL (DBSQL) on Azure Databricks. DBSQL Serverless makes it easy to get started with data warehousing on the lakehouse. Serverless compute for DBSQL helps address challenges customers face with cluster startup time, capacity management, and infrastructure costs:

Click through for more details and a short video. Azure Synapse Analytics and Databricks are definitely going head-to-head in the modern data warehousing space and I’m fine with that—hopefully it makes both products better as a result.

Comments closed

Multi-Developer Power BI Development

Reza Rad architects a solution for multiple developers working on a Power BI project:

Before I start explaining the architecture, it is important to understand the challenge and think about how to solve it. The default usage of Power BI involves getting data imported into the Power BI data model and then visualizing it. Although there are other modes and other connection types, however, the import data is the most popular option. However, there are some challenges in a model and a PBIX file with everything in one file. Here are some;

– Multiple developers cannot work on one PBIX file at the same time. Multi-Developer issue.

– Integrating the single PBIX file with another application or dataset would be very hard. High Maintenance issue.

– All data transformations are happening inside the model, and the refresh time would be slower.

– The only way to expand visualization would be by adding pages to the model, and you will end up with hundreds of pages after some time.

– Every change, even a small change in the visualization, means deploying the entire model.

– Creating a separate Power BI file with some parts it referencing from this model would not be possible; as a result, you would need to make a lot of duplicates and high maintenance issues again.

– If you want to re-use some of the tables and calculations of this file in other files in the future, it won’t be easy to maintain when everything is in one file.

– And many other issues.

After laying out all of the challenges, Reza puts together a plan to resolve them.

Comments closed