2021-07-02 – Curated SQL

Memory-Optimized tempdb Metadata Bug

Published 2021-07-02 by Kevin Feasel

Hey folks! Today I’ll share another bug we found recently when testing some new SQL Server 2019 instances in our development environment. This one is a bit concerning as it could affect a lot of shops and the bug presents itself during a pretty common use case.

The unfortunate thing is that ours is exactly the type of environment that memory-optimized tempdb metadata was made for.

Comments closed

Random Forest Feature Importance

Published 2021-07-02 by Kevin Feasel

Selcuk Disci takes us through an important concept with random forest models:

The random forest algorithms average these results; that is, it reduces the variation by training the different parts of the train set. This increases the performance of the final model, although this situation creates a small increase in bias.
The random forest uses bootstrap aggregating(bagging) algortihms. We would take for training sample, X = x₁, …, x_n and, Y = y₁, …, y_n for the outputs. The bagging process repeated B times with selecting a random sample by changing the training set and, tries to fit the relevant tree algorithms to the samples. This fitting function is denoted f_b in the below formula.

As far as the article goes, inflation is always and everywhere a monetary phenomenon. H/T R-Bloggers.

Comments closed

Parallelizing R Code

Published 2021-07-02 by Kevin Feasel

Mira Celine Klein walks us through some of the basics of parallel code execution in R:

In many cases, your code fulfills multiple independent tasks, for example, if you do a simulation with five different parameter sets. The five processes don’t need to communicate with each other, and they don’t need any result from any other process. They could even be run simultaneously on five different computers… or processor cores. This is called parallelization. Modern desktop computers usually have 16 or more processor cores. To find out how many cores you have on your PC, use the function detectCores(). By default, R uses only one core, but this article tells you how to use multiple cores. If your simulation needs 20 hours to complete with one core, you may get your results within four hours thanks to parallelization!

Read on to see how you can accomplish this, but note that it is operating system-dependent.

Comments closed

Using Filter Based Feature Selection in Text Analytics

Published 2021-07-02 by Kevin Feasel

Dinesh Asanka takes us through a text analytics technique in Azure Machine Learning:

There are two parameters to be defined in the Feature Hashing control. Hashing bitsize will define the maximum number of vectors. 10 hashing bitsize means 1,024 vectors (2^10). 1,024 vectors are more than enough even for the large volume text files. Next, we need to choose N-grams which is 2 as 2 is the optimal number for N-grams for most situations. A detailed description of N-Grams is given in the link given in the reference section.
After the vectors are generated, we do not need other text columns. Apart from the vectors, we need only the dependent attribute or the category column in this example. Therefore, we can remove the unnecessary attributes by Select Columns in dataset control. However, this control will show 1,024 vectors even though it is not available in the previous step, Feature Hashing. Therefore, you need to choose only the available attributes in the Feature Hashing control at the Select Columns in dataset control. In the above example, only 93 vectors were generated.

Click through to learn more.

Comments closed

Centralized and Decentralized Data Architectures

Published 2021-07-02 by Kevin Feasel

James Serra looks at a pattern:

A centralized data architecture means the data from each domain/subject (i.e. payroll, operations, finance) is copied to one location (i.e. a data lake under one storage account), and that the data from the multiple domains/subjects are combined to create centralized data models and unified views. It also means centralized ownership of the data (usually IT). This is the approach used by a Data Fabric.
A decentralized distributed data architecture means the data from each domain is not copied but rather kept within the domain (each domain/subject has its own data lake under one storage account) and each domain has its own data models. It also means distributed ownership of the data, with each domain having its own owner.
So is decentralized better than centralized?

Read on for James’s answer, and allow me to include a Dilbert cartoon so old, the boss didn’t even have pointy hair yet.

How Decentralized Organizations Can be Effective | The Fourth Revolution Blog

Comments closed

Local Variables and Cursors

Published 2021-07-02 by Kevin Feasel

Erik Darling might have found the Bermuda Triangle:

While helping a client out with weird performance issues, we isolated part of the code that was producing a whole bunch of bad plans.
At the intersection of bad ideas, there was a cursor looping over a table gathering some data points with a local variable in the where clause.

Click through to understand why this isn’t a great idea.

Comments closed

A Show about Nothing

Published 2021-07-02 by Kevin Feasel

Joe Celko has a moment of zen:

Human beings are not very good at handling nothing. The printing press didn’t just lead to civilization as we know it, but it also changed our mindset about text. When we wrote text manually on paper, a blank or space was not seen as a character. It was just the background upon which characters were written.
It was centuries before the zero was accepted as a number. After all, it represents the absence of a quantity or magnitude or position; how could it possibly be a number? Before it was accepted as a number, it was considered a symbol or mark in a positional notation to indicate that there was nothing in that position.

It’s an interesting riff, so check it out.

Comments closed

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Day: July 2, 2021

Memory-Optimized tempdb Metadata Bug

Random Forest Feature Importance

Parallelizing R Code

Using Filter Based Feature Selection in Text Analytics

Centralized and Decentralized Data Architectures

Local Variables and Cursors

A Show about Nothing