Press "Enter" to skip to content

Day: June 2, 2022

ML Algorithms a Poor Fit for Predictive Caches

Pete Warden describes an interesting phenomenon:

I’ve been working on a new research paper, and a friend gave me the feedback that he was confused by the statement “memory accesses can be accurately predicted at the compilation stage” for machine learning workloads, and that this made them a poor fit for conventional processor architectures with predictive caches. I realized that this was received wisdom among the ML engineers I know, but I wasn’t aware of any papers that discuss this point. I put out a request for help on Twitter, but while there were a lot of interesting resources in the answers, I still couldn’t find any papers that focused on what feels like an important property for machine learning systems. With that in mind, I wanted to at least describe the issue as best as I can in this blog post, so there’s a trail of breadcrumbs for anyone else interested in how system designs might need to change to accommodate ML.

Read on for the explanation. My reading here is that this is a downside to having general-purpose compute: you run the risk of sub-optimal performance in certain circumstances, like training models using certain types of ML algorithms.

Comments closed

Data Modeling with Spark–Breaking Data into Multiple Tables

Landon Robinson tokenizes data:

The result of joining the 2 DataFrames – pets and colorsdisplays the nicknamecolor and age of the pets. We went from a normalized dataset where common & recurring values weresubstituted for numeric representation s— to a slightly more denormalized dataset. Let’s keep going!

This is an interesting example of a useful technique but I strongly disagree with Landon about whether this is normalization. Translating a natural key to a surrogate key is not normalizing the data and translating a surrogate key to a natural key (which is what the example does) is not denormalizing the data. A really simplified explanation of the process is that normalization is ensuring that like things are grouped together, not that we build key-value lookup tables for everything. That’s why Landon’s “denormalized” example is just as normalized as the original: each of those attributes describes a unique thing about the pet identified by its (unique) nickname. This would be different if we included things like owner’s name (which could still be on that table), owner’s age, owner’s height, a list of visits to the vet for each pet, when the veterinarians received their licenses, etc.

Comments closed

Cancelling a Cosmos DB Query

Hasan Savran pushes the big red button:

Sometimes you go to a website and the page does not want to open for whatever reason. You might just click Stop and move on to another page. The stop button does not really stop the request, the server still tries to complete the request and send a response to the client even client does not exist anymore. Rather than clicking on Stop maybe you click Refresh which triggers another request when the first one is not completed yet.

      This scenario applies to all database calls. It might take longer to run a query for a reason and simply you need to wait until you get a response from the database server. If it takes too much time, you might want to cancel the request and try to look at your query to make it faster. You can do that programmatically by using CancellationTokens.

Read on for more information about cancellation tokens and how you can use them.

Comments closed

Inferring Data from Its Absence

John Cook lays out an important insight:

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

Read the whole thing. This also leads to a swath of security attacks based around unions of information in which each query may data only when X number of people are in it (to prevent us narrowing down to one person) but based on some information I know about the person, I can write a combination of queries to elicit more info about that person. As an example, if I know that a person is left-handed (1/9 of the population), has red hair (around 2% of people), etc., I can find ways to combine these traits to make sure no individual query returns fewer than X results but I can have reasonably high confidence that I can get the individual with enough queries.

Comments closed

Expressive T-SQL with Equivalent Performance

Aaron Bertrand shows off one of the reasons I like SQL so much:

As an analogy, there are many routes you can take from New York City to Dallas. Some may be faster than others, some may be fewer miles but take longer, some are more fuel-efficient due to average speed limits, some more scenic, and some more toll-friendly. The beauty is that if you and I are independently planning the same trip, we can choose our routes based on our individual priorities. I may not like interstates, or I may prefer to drive more westerly until the sun starts setting, and you may want to see a particular tourist attraction, visit an uncle, or stop in a certain city.

A query is similar. Usually, performance is of utmost importance, but even that isn’t always true. When two or more queries give the same answer and have identical (or “close enough”) performance, the choice can come down to other factors, as mentioned above. I recently answered a question on Stack Overflow where the user was asking how to filter a grouping where an aggregate condition was true.

The fundamental insight here is that SQL is a 4th generation language, otherwise known as a declarative language: we tell the interpreter what we want and let it determine a path to get us there. By contrast, 3rd generation languages like C are imperative languages: it does what we tell it to do, no more, no less (until the compiler gets in there and re-writes our code to make it better…but these are academic ideas languages approach, not hard-and-fast mandates). There are benefits and drawbacks to either language depending on how creative you are and how good the interpreter/compiler is.

Comments closed

Unit Testing ADX Functions

David Giard builds some tests:

Our application contains many functions that return data stored in Azure Data Explorer (ADX). We wrote these functions in Kusto Query Language (KQL) and each function returns a set of data based on the arguments passed. Although developers tested these functions as they wrote them, we needed a way to validate that the functions continued to work as the code and the data changed.

Automated Unit testing is an essential part of any application development life cycle. It validates that code works properly and minimizes the risk that future code changes will break existing functionality.

In this article, I will discuss the approach we took in automating the testing of ADX functions.

Click through to see how to use the assert() function and build some tests.

Comments closed

Language Translation via Power BI Field Parameters

Gerhard Brueckl shows off a great use of Power BI Field Parameters:

The current approaches when it comes to data and value translations are more workarounds than actual solutions. They probably work fine for small data models and very specific use-cases but usually fall short in performance, usability or maintainability when implemented on a larger scale enterprise models.

The recently introduced Field Parameters in Power BI give us a bit more flexibility here and another potential solution to implement data and value translations in Power BI.

Click through for an example which shows data in English, Spanish, and French.

Comments closed