Press "Enter" to skip to content

Author: Kevin Feasel

Secure Azure Data Source Access from Databricks

Bhavin Kukadia, Abhinav Garg, and Michal Marusan show us the right way to access Azure data sources from Azure Databricks:

Enterprise Security is a core tenet of building software at both Databricks and Microsoft, and thus it’s considered as a first-class citizen in Azure Databricks. In the context of this blog, secure connectivity refers to ensuring that traffic from Azure Databricks to Azure data services remains on the Azure network backbone, with the inherent ability to whitelist Azure Databricks as an allowed source. As a security best practice, we recommend a couple of options which customers could use to establish such a data access mechanism to Azure Data services like Azure Blob StorageAzure Data Lake Store Gen2Azure Synapse Data WarehouseAzure CosmosDB etc. Please read further for a discussion on Azure Private Link and Service Endpoints.

This is more about network configuration rather than things like “store your credentials and other secrets in Azure Key Vault,” which is also a good idea.

Comments closed

Building Metadata for an ADF Pipeline

Paul Andrew continues a series on Azure Data Factory and metadata-driven pipelines:

Welcome back friends to part 2 of this 4 part blog series. In this post we are going to deliver on some of the design points we covered in part 1 by building the database to house our processing framework metadata.

Let’s start with a nice new shiny Azure SQLDB database and schema. This can easily be scaled up as our calls from Data Factory increase and ultimately the solution we are using the framework for grows.

Soon we will get to see the Azure Data Factory power in action.

Comments closed

Real-Time Replay with WorkloadTools

Gianluca Sartori shows us how to perform a real-time replay with WorkloadTools:

Before we jump to how, I’d better spend some words on why a real-time replay is needed.

The main reason is the complexity involved in capturing and analyzing a workload for extended periods of time. Especially when performing migrations and upgrades, it is crucial to capture the entire business cycle, in order to cover all possible queries issued by the applications. All existing benchmarking tools require to capture the workload to a file before it can be analyzed and/or replayed, but this becomes increasingly complicated when the length of the business cycle grows.

I’m not sure how frequently I’d use real-time replays, but it’s nice to know that it’s pretty easy to pull off with WorkloadTools.

Comments closed

Setting Up a SQL Server Lab with AutomatedLab

Jess Pomfret looks at a very interesting Powershell module:

There is a fantastic PowerShell module called AutomatedLab that can enable you to easily build out a lab for the  specific scenario you need to test. Even better is the module comes with 70 sample scripts that you can start with and adapt to meet your needs.

The module gives you the option to work with Hyper-V or VMWare. I will say most of the examples are using Hyper-V, and that is what I’ll be using also.

For my lab I want a SQL Server 2019 instance joined to a domain, and a separate client machine that I can manage the SQL Server from. On the client I would need to be able to connect to the internet as I want to be able to download PowerShell modules from the gallery easily.

It’s about time for me to rebuild my lab, so I’ll need to check that out.

Comments closed

Approximate Distinct Count with DAX

Gilbert Quevauvilliers runs some performance tests against the approximate distinct count formula in DAX:

I am currently running SQL Server Analysis Services (SSAS) 2019 Enterprise Edition. (This can also be applied to Power BI)

My Fact table has got roughly 950 Million rows stored in

And as mentioned previously it has got over 64 Million distinct users.

The data is queried from SQL Server into SSAS.

Gilbert first checks how close these are and then how much faster the approximate count is.

Comments closed

Pasting an R Plot into Word

Eran Raviv takes us through converting a plot in R to work with Microsoft Word:

In this post you will learn how to properly paste an R plot\chart\image to a word file. There are few typical problems that occur when people try to do that. Below you can find a simple, clean and repeatable solution.
When you google how to paste a plot from R to a word file you find that there are some solutions. But they are not satisfactory. For example, stackoverflow highest-ranking reply offers to use the Rstudio button to export your plot as an Enhanced Metafile (EMF) format. Couple of things wrong with it: the first is that you need to start messing around with the device scaling, because the export remembers the port dimensions. The second is that the word file is often not the final version. For better readability\representation we often convert the word to a pdf format before sending\publishing. But then you get something funny which you may have seen before, and drove some people insane consumed much of some people’s time, myself included:

Also check out the linked blog post for additional insights into why this happened.

Comments closed

Incremental Imports with Sqoop

Jon Morisi continues a series on Sqoop:

In my last two blog posts I walked through how to use Sqoop to perform full imports.  Nightly full imports with overwrite has it’s place for small tables like dimension tables.  However, in real-world scenarios you’re also going to want a way to import only the delta values since the last time an import was run.  Sqoop offers two ways to perform incremental imports: append and lastmodified.

Both incremental imports can be run manually or created as job using the “sqoop job” command.  When running incremental imports manually from the command line the “–last-value” arg is used to specify the reference value for the check-column.  Alternately sqoop jobs track the “check-column” in the job and the value of the check-column is used for subsequent job runs as the where predicate in the SQL statement.  I.E. select columns from table where check-column > (last-max-check-column-value).

This is where Sqoop starts to break down for me, and Jon lists some of the issues in the post.

Comments closed

LAST_VALUE() and Windows

Jeet Kainth explains the importance of specifying your window when using LAST_VALUE():

To return the actual last value, add the additional clause ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. This ensures that for each row the LAST_VALUE() function looks at all rows; from the very first row to the very last row.

Click through for the example. Remember that the default is going to the current row, not the entire data set.

Comments closed

Building a Cartesian Product with Power Query

Reza Rad shows how to build a Cartesian product using Power Query:

Sometimes, you need to create a multiplication of all sets of all pairs from two different data tables in Power BI. This action is different from Merge (Join) because there is no matching key columns and no way to relate the two tables together. However, you want to create a multiplication as a flatten table. If this process is needed to be done in Power Query, then there is a simple trick to do it, In this article, I’ll explain how you can do it.

Read on for that trick.

Comments closed