May 2020 – Page 4 – Curated SQL

Planning a Power BI Enterprise Deployment: In Whitepaper Form

Published 2020-05-22 by Kevin Feasel

Melissa Coates has an updated whitepaper for us:

I’m really excited to announce that a new version of the Microsoft whitepaper “Planning a Power BI Enterprise Deployment” is now available.
This is version 3 of the whitepaper that I co-authored with Chris Webb. The previous version was from July 2018, so this update includes quite a lot of changes throughout.
Huge high-five to Meagan Longoria who was our tech reviewer again. She never fails to make my writing better.

That’s three very sharp people, so you can bet it’s going to be good.

Comments closed

Understanding Heaps in SQL Server

Published 2020-05-22 by Kevin Feasel

Uwe Ricken has a series on the much-maligned heap:

This article is the beginning of a series of articles about Heaps in Microsoft SQL Server. Heaps are rejected by many database developers using Microsoft SQL Server. The concerns about Heaps are even fuelled by Microsoft itself by generally recommending the use of clustered indexes for every table. Globally renowned SQL Server experts also generally advise that tables in Microsoft SQL Server be provided with a clustered index.
Again, and again, I try to convince developers that a heap can even have advantages. I have discussed many pros and cons with these people and would now like to break a “PRO HEAP” lance. This article deals with the basics. Important system objects that play a major role in Heaps are only superficially presented in this article and described in detail in a follow up article.

I’m generally in the anti-heap camp, but I can acknowledge that there are situations in which heaps are better—I save my dogmatism for other things, like hating pie charts and loving representations of things as event streams.

Comments closed

Understanding RID Lookups

Published 2020-05-22 by Kevin Feasel

Hugo Kornelis takes us through an operator I usually don’t want to see:

The RID Lookup operator offers the same logical functionality within the execution plan as the Key Lookup operator. But where Key Lookup is used for tables that have a clustered index, RID Lookup is instead used when a table is “heap” (table without clustered index). It is used when another operator (usually an Index Seek, sometimes an Index Scan, rarely a combination of two or more of these or other operators) is used to find rows that need to be processed, but the index used does not include all columns needed for the query. The RID Lookup operator is then used to fetch the remaining columns from the heap structure where the table data is stored.

Click through for a great deal of information about RID Lookups.

Comments closed

Ten Comments from a DBA

Published 2020-05-22 by Kevin Feasel

Kevin Chant spins the DBA archetype around:

2. So glad we test our backups.
Now this is something that every DBA should do, or at least persuade whoever is responsible for backups to do it.
Otherwise you may find yourself in a situation where a database is corrupt and a restore is not possible. Which means that you have to try and recover the database using other methods like the one here.
I can tell you from experience that this is definitely not the best situation to be in unless you enjoy working for over twenty-four hours straight. So, if your backups are not being tested at the moment then I highly recommend you change that.

And if you are a DBA who can’t say this, Kevin has some advice for each of the ten.

Comments closed

dbatools Commands for Performance Tuning

Published 2020-05-22 by Kevin Feasel

John McCormack takes a look at dbatools with an eye on performance tuning:

DBATools is well known in the SQL Server community for the ease at which it allows you to automate certain tasks but did you know that DBATools can help with performance tuning your SQL Server. As my job recently changed to have more of a performance tilt, I wanted to investigate which commands would be useful in helping me with performance tuning. It turned out there are quite a few.

There are some good commands in here.

Comments closed

Data Platform Announcements from Build

Published 2020-05-22 by Kevin Feasel

James Serra looks at some announcements from the Build conference:

A few data platform announcements yesterday at Microsoft Build that I wanted to blog about.
The biggest one is Azure Synapse Analytics is now available in public preview! You can immediately log into your Azure portal and use it. While in the Azure portal, search for “Synapse” and you will see “Azure Synapse Analytics (workspaces preview)”. Choose that and then click “Create Synapse workspace” (you first may need to register the resource provider “Microsoft.Synapse” in your subscription – see Azure resource providers and types).

James also covers other highlights, including Cosmos DB and Azure SQL Database Edge.

Comments closed

Distributed Model Training with Cloudera ML

Published 2020-05-21 by Kevin Feasel

Zuling Kang and Anand Patil show us how to train models across several nodes using Cloudera Machine Learning:

Deep learning models are generally trained using the stochastic gradient descendent (SGD) algorithm. For each iteration of SGD, we will sample a mini-batch from the training set, feed it into the training model, calculate the gradient of the loss function of the observed values and the real values, and update the model parameters (or weights). As it is well known that the SGD iterations have to be executed sequentially, it is not possible to speed up the training process by parallelizing iterations. However, as processing one single iteration for a number of commonly used models like CIFAR10 or IMAGENET takes a long time, even using the most sophisticated GPU, we can still try to parallelize the feedforward computation as well as the gradient calculation within each iteration to speed up the model training process.
In practice, we will split the mini-batch of the training data into several parts, like 4, 8, 16, etc. (in this article, we will use the term sub-batch to refer to these split parts), and each training worker takes one sub-batch. Then the training workers do feedforward, gradient computation, and model updating using the sub-batches, respectively, just as in the monolithic training mode. After these steps, a process called model average is invoked, averaging the model parameters of all the workers participating in the training, so as to make the model parameters exactly the same when a new training iteration begins. Then the new round of the training iteration starts again from the data sampling and splitting step.

Read on for the high-level explanation, followed by some Python code working in TensorFlow.

Comments closed

Spark Application Execution Modes

Published 2020-05-21 by Kevin Feasel

Kundan Kumarr explains how the two execution modes differ with Apache Spark:

Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will be assigned a task and it will consolidate and collect the result back to the driver. A spark application gets executed within the cluster in two different modes – one is cluster mode and the second is client mode.

Click through for a comparison.

Comments closed

Big-O Notation in .NET

Published 2020-05-21 by Kevin Feasel

Camilo Reyes takes us through a useful concept in computer science as applied to .NET Core:

Performance sensitive code is often overlooked in business apps. This is because high-performance code might not affect outcomes. Concerns with execution times are ignorable if the code finishes in a reasonable time. Apps either meet expectations or not, and performance issues can go undetected. Devs, for the most part, care about business outcomes and performance is the outlier. When response times cross an arbitrary line, everything flips to less than desirable or unacceptable.
Luckily, the Big-O notation attempts to approach this problem in a general way. This focuses both on outcomes and the algorithm. Big-O notation attempts to conceptualize algorithm complexity without laborious performance tuning.

This is a rather high-level take on the idea, as it doesn’t cover any of the O(NlogN) or O(logN) algorithms out there. But if you are not familiar with the concept, it is good to know.

Comments closed

Standardized DAX Separators in Power BI Desktop

Published 2020-05-21 by Kevin Feasel

Marco Russo goes over the ramifications of a recent change to Power BI Desktop:

Starting from the May 2020 version of Power BI Desktop, regardless of the Windows locale settings DAX always uses standard separators by default. This change does not affect most Power BI users around the world; but if you use the comma as a decimal separator and 10,000 is just ten and not ten thousand, then you are affected by this change.
First of all, you can restore the previous behavior as I describe in this post, but the recommendation and the default are now to use the standard DAX separators. I want to describe why this (I think good) choice was made and how I contributed to making this happen.

Read the whole thing.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Month: May 2020