Press "Enter" to skip to content

Month: June 2022

Data Modeling with Spark–Breaking Data into Multiple Tables

Landon Robinson tokenizes data:

The result of joining the 2 DataFrames – pets and colorsdisplays the nicknamecolor and age of the pets. We went from a normalized dataset where common & recurring values weresubstituted for numeric representation s— to a slightly more denormalized dataset. Let’s keep going!

This is an interesting example of a useful technique but I strongly disagree with Landon about whether this is normalization. Translating a natural key to a surrogate key is not normalizing the data and translating a surrogate key to a natural key (which is what the example does) is not denormalizing the data. A really simplified explanation of the process is that normalization is ensuring that like things are grouped together, not that we build key-value lookup tables for everything. That’s why Landon’s “denormalized” example is just as normalized as the original: each of those attributes describes a unique thing about the pet identified by its (unique) nickname. This would be different if we included things like owner’s name (which could still be on that table), owner’s age, owner’s height, a list of visits to the vet for each pet, when the veterinarians received their licenses, etc.

Comments closed

Cancelling a Cosmos DB Query

Hasan Savran pushes the big red button:

Sometimes you go to a website and the page does not want to open for whatever reason. You might just click Stop and move on to another page. The stop button does not really stop the request, the server still tries to complete the request and send a response to the client even client does not exist anymore. Rather than clicking on Stop maybe you click Refresh which triggers another request when the first one is not completed yet.

      This scenario applies to all database calls. It might take longer to run a query for a reason and simply you need to wait until you get a response from the database server. If it takes too much time, you might want to cancel the request and try to look at your query to make it faster. You can do that programmatically by using CancellationTokens.

Read on for more information about cancellation tokens and how you can use them.

Comments closed

Inferring Data from Its Absence

John Cook lays out an important insight:

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

Read the whole thing. This also leads to a swath of security attacks based around unions of information in which each query may data only when X number of people are in it (to prevent us narrowing down to one person) but based on some information I know about the person, I can write a combination of queries to elicit more info about that person. As an example, if I know that a person is left-handed (1/9 of the population), has red hair (around 2% of people), etc., I can find ways to combine these traits to make sure no individual query returns fewer than X results but I can have reasonably high confidence that I can get the individual with enough queries.

Comments closed

Expressive T-SQL with Equivalent Performance

Aaron Bertrand shows off one of the reasons I like SQL so much:

As an analogy, there are many routes you can take from New York City to Dallas. Some may be faster than others, some may be fewer miles but take longer, some are more fuel-efficient due to average speed limits, some more scenic, and some more toll-friendly. The beauty is that if you and I are independently planning the same trip, we can choose our routes based on our individual priorities. I may not like interstates, or I may prefer to drive more westerly until the sun starts setting, and you may want to see a particular tourist attraction, visit an uncle, or stop in a certain city.

A query is similar. Usually, performance is of utmost importance, but even that isn’t always true. When two or more queries give the same answer and have identical (or “close enough”) performance, the choice can come down to other factors, as mentioned above. I recently answered a question on Stack Overflow where the user was asking how to filter a grouping where an aggregate condition was true.

The fundamental insight here is that SQL is a 4th generation language, otherwise known as a declarative language: we tell the interpreter what we want and let it determine a path to get us there. By contrast, 3rd generation languages like C are imperative languages: it does what we tell it to do, no more, no less (until the compiler gets in there and re-writes our code to make it better…but these are academic ideas languages approach, not hard-and-fast mandates). There are benefits and drawbacks to either language depending on how creative you are and how good the interpreter/compiler is.

Comments closed

Unit Testing ADX Functions

David Giard builds some tests:

Our application contains many functions that return data stored in Azure Data Explorer (ADX). We wrote these functions in Kusto Query Language (KQL) and each function returns a set of data based on the arguments passed. Although developers tested these functions as they wrote them, we needed a way to validate that the functions continued to work as the code and the data changed.

Automated Unit testing is an essential part of any application development life cycle. It validates that code works properly and minimizes the risk that future code changes will break existing functionality.

In this article, I will discuss the approach we took in automating the testing of ADX functions.

Click through to see how to use the assert() function and build some tests.

Comments closed

Language Translation via Power BI Field Parameters

Gerhard Brueckl shows off a great use of Power BI Field Parameters:

The current approaches when it comes to data and value translations are more workarounds than actual solutions. They probably work fine for small data models and very specific use-cases but usually fall short in performance, usability or maintainability when implemented on a larger scale enterprise models.

The recently introduced Field Parameters in Power BI give us a bit more flexibility here and another potential solution to implement data and value translations in Power BI.

Click through for an example which shows data in English, Spanish, and French.

Comments closed

Defining Data Products Down

Paul Andrew plays Papers, Please! with data mesh:

The reasons for calling out what should not be declared as a data product can take many different forms and in some cases even result in anti-patterns that should be addressed as technical debt etc. Governanace and compliance has always been a key part of successful platform delivers, but who/how do we go about policing this both technically and in the context of people/process? Maybe ‘Data Product Police’ could be a thing! 

Read on to see how to spot a data pretender in a field of data products.

Comments closed

Consuming an Azure ML AutoML Model in Excel

Lewis Prince needs to do some heavy lifting in Excel:

It has come back to my turn to write a blog post, and if you remember my previous one concerned why you should use Azure based AutoMl and subsequently how to do so. If you followed that then you will be left with a model of which you’ve scored and know the performance of, but no way of how to then deploy and use your model. I will outline the steps needed to do this (which involves a major shortcut as we are using an AutoMl model), and then show you the required VBA needed to consume this in Microsoft Excel.

Read on to see how you can do this. Back in the really old Azure ML days, you could download an Excel workbook which would have things set up and you could feed in a bunch of input data and get predictions.

Comments closed

Azure Synapse Analytics May 2022 Updates

Ryan Majidimehr lays out some updates for Azure Synapse Analytics:

Serverless SQL pools let you query files in the data lake without knowing the schema upfront. The best practice was to specify the lengths of character columns to get optimal performance. Not anymore!  

Previously, you had to explicitly define the schema to get optimal query performance. In this case, the column countries_and_territories is defined as varchar(50):  

There are some interesting updates in this month’s release, including the public preview of Azure Synapse Link for SQL, which connects to Azure SQL DB and SQL Server 2022.

Comments closed