Author: Kevin Feasel

Shrinking Convolutional Neural Networks for TinyML

Published 2021-08-06 by Kevin Feasel

A colleague recently asked for more details on an approach I recommended, but which she hadn’t seen any documentation for. I realized that it was something I’d learned from talking to model builders at Google, and I wasn’t sure there was anything written up, so in the spirit of leaving a trail of breadcrumbs for anyone coming after, I thought I should put it into a quick blog post.
The summary is that if you have MaxPool or AveragePool after a convolutional layer in a network, and you’re targeting a resource-constrained system like a microcontroller, you should try removing them entirely and replacing them with a stride in the convolution instead. This has two main benefits, but to explain it’s easiest to diagram out the network before and after.

Click through for the full explanation.

Comments closed

The Basics of Table Variables

Published 2021-08-06 by Kevin Feasel

Jared Poche explains some of the value of table variables:

In having a talk reviewed recently, it was suggested I spend more time defining some of the subject I touched on. It occurred if I should go over (or at least introduce) these ideas during a talk for a SQL Saturday audience, some might find a post on the subject useful. Hence my recent post on key lookups.
Another such topic is table variables. I use table variables frequently at my current job, but they came up very infrequently when I worked at CSS in Microsoft. I remember the conversations about them being very simple at the time, as in, “you should just use temp tables instead.” But there is a lot of utility with table variables, and they could be a useful arrow in your quiver.

Click through for the full story.

Comments closed

Diving into Data Mesh

Published 2021-08-06 by Kevin Feasel

Ust Oldfield has a series on data mesh. First up is a primer on the topic:

What is a Data Mesh? Data Mesh is new and was introduced in a few articles by Zhamak Dehghani, starting in May 2019. The first article, How To Move Beyond a Monolithic Data Lake to a Distributed and Data Mesh, and the second article, Data Mesh Principles and Logical Architecture, form the foundational thought pieces on data meshes. Very simply, and crudely, a Data Mesh aims to overcome the deficiencies of previous generations of analytical architectures by decentralising the ownership and production of analytical data to the teams who own the data domain. It is a convergence of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data.
Essentially applying the learnings from operational systems, of applying domain driven design, so that ownership of data is domain oriented. For example, a CRM team will own the Customer domain and all the data within it. Instead of data from all domains flowing into a central data repository, the individual domains curate and serve their datasets in an easily consumable way. A much closer alignment between operational and analytical data.

Then Ust has a deep dive:

A foundational principle of data mesh is the decentralisation of ownership of data to those closest to it. Those closest to it are those in the business, often using the operational systems and also using analytical data. Responsibility and ownership of the data is devolved from a central function to business units and domains. Therefore any change to how a business domain organises itself is limited to the domain itself rather than impacting the entire organisation. This is referred to as the bounded domain context.
Teams responsible for the business operation, e.g. CRM team responsible for Customers, are also responsible for the ownership and serving of the analytical data relating to their domain.

I probably need to spend more time thinking about it, but this feels like a rickety ladder of incentive compatibility problems. I agree with Ust’s characterization of warehouses as staid and that relatively little has happened since the Kimball model came to prominence, but this sounds like taking the problems of the Kimball model (slow to change, an iceberg of code & ETL, etc.) and effectively saying “Hey, here are slightly different views of the source systems, have fun” instead. That way, everybody gets to experience the joy of non-conforming dimensions, disagreeing facts, working with data in different grains, and seven terms for the same thing.

Comments closed

Indexes and Sorts

Published 2021-08-06 by Kevin Feasel

Chad Callihan reminds us that sort order can matter for indexes:

When you’re working on an index for a query ordering by one column in ascending order and another column in descending order, do you have your index created to match? Did you know you can specify ASC or DESC in an index? Let’s go through a scenario where ordering in an index makes a noticeable difference.

This is particularly important for window functions—the optimizer can sometimes be smart enough to recognize that a value is in reverse order and not need to use a sort operator, but as soon as you drop that OVER() clause in, if things aren’t in the exact order they need, you get a sort operator thrown in for free. Or, well, the “your query is now a little bit slower” version of free.

Comments closed

Removing Data from Query Store

Published 2021-08-06 by Kevin Feasel

Erin Stellato shows us how to do some spring cleaning:

Automatic removal of data from Query Store is based on configuration, notably the MAX_STORAGE_SIZE_MB, CLEANUP_POLICY (STALE_QUERY_THRESHOLD_DAYS), and SIZE_BASED_CLEANUP_MODE settings. While these are detailed in another post, it’s worth reiterating that ideally, size-based cleanup never kicks in. A database’s Query Store should be sized to accommodate N days’ worth of data, where N is set by CLEANUP_POLICY (STALE_QUERY_THRESHOLD_DAYS). When the size of the Query Store exceeds 90% of MAX_STORAGE_SIZE_MB, and if SIZE_BASED_CLEANUP_MODE is set to AUTO, then clean up will kick in. This algorithm is not efficient, and it runs single-threaded. It looks for queries that are infrequently executed/less important and deletes those, one by one, until the size is less than 80% of MAX_STORAGE_SIZE_MB. Avoid this type of cleanup if at all possible.

There is a better way. And an even better way thanks to some of my colleagues.

Comments closed

Deploying Custom Docker Images in Azure ML

Published 2021-08-05 by Kevin Feasel

Tsuyoshi Matsuzaki shows us how to deploy an Azure ML model via custom Docker image:

In my early post, I have showed you how to bring your own custom docker image in training with Azure Machine Learning.
On the contrary, here I’ll show you how to bring custom docker image in model deployment.
In Azure Machine Learning, the base docker image in deployment includes the inferencing assets, such as, Flask server, etc. So you should use AML compliant image for base image, even when you use your own custom docker image.
The list of these maintained AML images is available in https://github.com/Azure/AzureML-Containers .

Read on for an example.

Comments closed

Finding and Removing Custom Roles, Schemas, and Users from a Database

Published 2021-08-05 by Kevin Feasel

Thomas Williams wants to go back to square one:

I’m a fan of the built-in database roles like db_datareader to standardise & simplify permissions (sorry Dr. Greg Low!) and recently I needed to do just that in a database created using SQL Server 2000, and remove old defaults and a lot of custom roles, schemas and users.
I wrote the set of queries below to generate scripts to remove non-built-in roles, schemas and users, when compared to the model database on a new SQL Server 2019 server.
After running the script generated by the queries, I added back users and gave appropriate roles (like db_datareader).

Read on for the script.

Comments closed

Representing Dates in Power BI: Date or Integer?

Published 2021-08-05 by Kevin Feasel

Marco Russo and Alberto Ferrari share their take on a classic debate:

A question that is often asked during the design of a Power BI data model is whether it is better to use an Integer or a Datetime column to link a fact table with the Date dimension. Historically, using Integers has always been a better choice in database design. However, Tabular is an in-memory columnar database, and its architecture is quite different from the relational databases we might be used to working with.
Indeed, in Tabular there are no technical differences between using a Datetime or an Integer to create a relationship. The database size, the query speed, and any other technical detail are absolutely identical. Therefore, the choice is not related to technical aspects, but rather on the convenience of the design. Depending on the specific requirements of your model, you might favor one data type against the other. In the most common scenarios, a Datetime proves to be better because it provides more possibilities to compute values on dates without having to rely on relationships. With that said, if your model uses Integers and you do not need to perform calculations on the dates represented in the table, then you can choose the most convenient data type – that is, the one already used in the original data source.
The remaining part of the article aims to prove the previous sentences, and to provide you with the technical details about how we tested the respective performance of the two options.

Click through for Marco and Alberto’s analysis, noting that “date” here does not include time of day, so it would have the same cardinality as the integer date key. This was a more important thing fifteen years ago, before columnstore technologies (like columnstore indexes and VertiPaq) were readily available and that 4-byte integer was considerably smaller than an 8-byte DATETIME.

Comments closed

Ways to avoid the MERGE Operator

Published 2021-08-05 by Kevin Feasel

Michael J. Swart has important bullet points:

Aaron Bertrand has a post called Use Caution with SQL Server’s MERGE Statement. It’s a pretty thorough compilation of all the problems and defects associated with the MERGE statement that folks have reported in the past. But it’s been a few years since that post and in the spirit of giving Microsoft the benefit of the doubt, I revisited each of the issues Aaron brought up.
Some of the items can be dismissed based on circumstances. I noticed that:
– Some of the issues are fixed in recent versions (2016+).
– Some of the issues that have been marked as won’t fix have been fixed anyway (the repro script associated with the issue no longer fails).
– Some of the items are complaints about confusing documentation.
– Some are complaints are about issues that are not limited to the MERGE statement (e.g. concurrency and constraint checks).

Spoilers: some + some + some + some is still a lot less than all. Read the whole thing.

Comments closed

Deploying Datasets in Azure Analysis Services and Power BI PPU

Published 2021-08-05 by Kevin Feasel

Gilbert Quevauvilliers continues a series on migrating from Azure Analysis Services to Power BI Premium Per User:

Welcome to part 8, where in this blog post, I am going compare deploying datasets.
For those people who are not exactly sure what deployments are, what this means is when you are using Power BI Desktop and you click on Publish, you are effectively deploying your changes to the Power BI Service (Which could also be a server in the cloud).
In this blog post I will show the differences when completing a deployment from AAS and then PPU.

Read on to see several techniques for deploying for each technology.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31