Press "Enter" to skip to content

Month: March 2020

Data Exfiltration Protection when Using Azure Databricks

Bhavin Kukadia, et al, explain how to prevent users from taking data from your Databricks cluster without authorization:

Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider’s network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service on Azure. We’ve come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it’s time that we share it out broadly.

Click through for the architectural pattern.

Comments closed

Deconstructing Running Totals with M

Cedric Charlier shows how to turn a running total into a periodic series of events with M:

When dealing with time series, you should always check if your time series are event-based or cumulative. For a time series with a step of one day, an event-based time series will contain the count of events for a given day. A cumulative time series will contain the sum of the event of that day and of all the previous days! In this blog post I’ll explain how to transform a cumulative time series into an event-based.

Click through for the code. You can do this in T-SQL as well by subtracting the value from its LAG()-ged value.

Comments closed

A New Powershell Module for SQL Server Security

Stuart Moore introduces dbaSecurityScan:

How easy it to audit them? If someone asks you the DBA exactly who has access to object A, can you tell them? How do people get access to that object, is it via a role, a schema or an explicit permission?

Is that information in an easy to read or manipulate manner?

How do you ensure that permissions persist between upgrades? I’ve certainly seen 3rd party upgrades that have reset database level permissions. Do you have a mechanism to check every permission and put them back as they were?

We’re all doing the devops these days. Our database schema is source controlled, and we’re deploying it incrementally in pipelines and testing it. But are we doing that with our database security?

So in the classic open source way, I decided to scratch my own itch by writing something. That something is dbaSecurityScan, a PowerShell module that aims to offer a solution for all of the above.

Click through to see what dbaSecurityScan covers today, how to call it, and what you can do to get more info.

Comments closed

Using Jupyter Notebooks in SQL Agent Jobs

Rob Sewell shows us how to run an Azure Data Studio notebook as a SQL Agent job:

Azure Data Studio is a great tool for connecting with your data platform whether it is in Azure or on your hardware. Jupyter Notebooks are fantastic, you can have words, pictures, code and code results all saved in one document.

I have created a repository in my Github https://beard.media/Notebooks where I have stored a number of Jupyter notebooks both for Azure Data Studio and the new .NET interactive notebooks.

Another thing that you can do with notebooks is run them as Agent Jobs and save the results of the run.

Read on to learn how.

Comments closed

Per-Query Wait Stats with Extended Events

Grant Fritchey shows us how to see the waits associated with a specific query:

And that my friends is only the waits associated with the one query. TA-DA indeed!

Now, we could get into filtering this stuff too. Toss the ones that have little to no duration, ensure that I only capture for a specific query or procedure, all would be helpful. However, this is how you can easily identify just the waits associated with a single query, and only that query.

Click through to see how. The one thing I’d caution here is that the query which received waits isn’t necessarily the query in the wrong—it might be the fourth or fifth session in a blocking chain. But this is a great technique for getting additional per-query info when you can control the experiment.

Comments closed

Replaying Workloads to a Different Database with WorkloadTools

Gianluca Sartori takes us through workload replay ability in WorkloadTools:

One of the features I was asked to implement for WorkloadTools is the ability to replay commands to a database name different from the one recorded in the source workload.

This is something that I had been planning to implement for a while and it totally makes sense. Usually, you have two identical environments for the workload capture and replay, both with the same databases. Sometimes it makes sense to have two different databases as the source and target for the workload, for some particular reasons: resources constraints, ease of testing and so on.

WorkloadTools now supports replaying commands to a different database, using the DatabaseMap property of the ReplayConsumer.

Setting this up is pretty simple, though Gianluca does lay out a caveat.

Comments closed

Parameterization and Enumerable.Contains() with EF Core 3

Erik Ejlskov Jensen explains how we can prevent Entity Framework Core 3 from polluting the plan cache if we use Enumerable.Contains():

One of the many advantages of using a tool like Entity Framework Core is, that you are sure that the framework will generate properly parameterized SQL for you. This helps avoid SQL injection issues and avoids plan cache pollution. Unfortunately, EF Core currently falls short on that promise, when translating queries, where you supply a list of values to be matched against a column – Enumerable.Contains method – this is translated to a SQL Server IN operator

Click through for a setup of the problem as well as the solution.

Comments closed

Installing Apache Airflow

Achilleus walks us through a process to install Apache Airflow on a machine:

Airflow is an amazing tool by Airbnb and is a kinda defacto standard of ETL deployments in the Data Engineering domain nowadays. But at the same time, you can also use Airflow to schedule to ML pipeline and automate the whole ML pipeline(almost).

This is my attempt to install and set up a fairly robust Apache Airflow deployment for my needs. I am pretty sure there might be some better ways of doing it or add any enhancements to it. Any comments or suggestions are highly appreciated!

This is an easy-to-follow set of steps, so check it out.

Comments closed

Tuning Random Forest HyperParameters with R

Julia Silge gives us an idea of how to tune random forest hyperparameters in R:

Our modeling goal here is to predict the legal status of the trees in San Francisco in the #TidyTuesday dataset. This isn’t this week’s dataset, but it’s one I have been wanting to return to. Because it seems almost wrong not to, we’ll be using a random forest model! 🌳

Let’s build a model to predict which trees are maintained by the San Francisco Department of Public Works and which are not. We can use parse_number() to get a rough estimate of the size of the plot from the plot_size column. Instead of trying any imputation, we will just keep observations with no NA values.

Click through to some data exploration, the initial model, and a process for using Grid Search with the caret package.

Comments closed