2020-03-04 – Curated SQL

Enterprise Security is a core tenet of building software at both Databricks and Microsoft, and thus it’s considered as a first-class citizen in Azure Databricks. In the context of this blog, secure connectivity refers to ensuring that traffic from Azure Databricks to Azure data services remains on the Azure network backbone, with the inherent ability to whitelist Azure Databricks as an allowed source. As a security best practice, we recommend a couple of options which customers could use to establish such a data access mechanism to Azure Data services like Azure Blob Storage, Azure Data Lake Store Gen2, Azure Synapse Data Warehouse, Azure CosmosDB etc. Please read further for a discussion on Azure Private Link and Service Endpoints.

This is more about network configuration rather than things like “store your credentials and other secrets in Azure Key Vault,” which is also a good idea.

Comments closed

Building Metadata for an ADF Pipeline

Published 2020-03-04 by Kevin Feasel

Paul Andrew continues a series on Azure Data Factory and metadata-driven pipelines:

Welcome back friends to part 2 of this 4 part blog series. In this post we are going to deliver on some of the design points we covered in part 1 by building the database to house our processing framework metadata.
Let’s start with a nice new shiny Azure SQLDB database and schema. This can easily be scaled up as our calls from Data Factory increase and ultimately the solution we are using the framework for grows.

Soon we will get to see the Azure Data Factory power in action.

Comments closed

Temporal Tables and Table Partitioning

Published 2020-03-04 by Kevin Feasel

Erik Darling does not like this:

“Consider turning off the feature that takes forever to turn back on with large tables so you can do the thing partitioning does quickly”
Don’t worry, the color red you’re seeing is totally natural.
And hey, once you’ve turned it off, you can swap a partition in or out.

It’s a painful experience, I agree.

Comments closed

Real-Time Replay with WorkloadTools

Published 2020-03-04 by Kevin Feasel

Gianluca Sartori shows us how to perform a real-time replay with WorkloadTools:

Before we jump to how, I’d better spend some words on why a real-time replay is needed.
The main reason is the complexity involved in capturing and analyzing a workload for extended periods of time. Especially when performing migrations and upgrades, it is crucial to capture the entire business cycle, in order to cover all possible queries issued by the applications. All existing benchmarking tools require to capture the workload to a file before it can be analyzed and/or replayed, but this becomes increasingly complicated when the length of the business cycle grows.

I’m not sure how frequently I’d use real-time replays, but it’s nice to know that it’s pretty easy to pull off with WorkloadTools.

Comments closed

Setting Up a SQL Server Lab with AutomatedLab

Published 2020-03-04 by Kevin Feasel

Jess Pomfret looks at a very interesting Powershell module:

There is a fantastic PowerShell module called AutomatedLab that can enable you to easily build out a lab for the specific scenario you need to test. Even better is the module comes with 70 sample scripts that you can start with and adapt to meet your needs.
The module gives you the option to work with Hyper-V or VMWare. I will say most of the examples are using Hyper-V, and that is what I’ll be using also.
For my lab I want a SQL Server 2019 instance joined to a domain, and a separate client machine that I can manage the SQL Server from. On the client I would need to be able to connect to the internet as I want to be able to download PowerShell modules from the gallery easily.

It’s about time for me to rebuild my lab, so I’ll need to check that out.

Comments closed

Approximate Distinct Count with DAX

Published 2020-03-04 by Kevin Feasel

Gilbert Quevauvilliers runs some performance tests against the approximate distinct count formula in DAX:

I am currently running SQL Server Analysis Services (SSAS) 2019 Enterprise Edition. (This can also be applied to Power BI)
My Fact table has got roughly 950 Million rows stored in
And as mentioned previously it has got over 64 Million distinct users.
The data is queried from SQL Server into SSAS.

Gilbert first checks how close these are and then how much faster the approximate count is.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Day: March 4, 2020

Secure Azure Data Source Access from Databricks

Building Metadata for an ADF Pipeline

Temporal Tables and Table Partitioning

Real-Time Replay with WorkloadTools

Setting Up a SQL Server Lab with AutomatedLab

Approximate Distinct Count with DAX