Press "Enter" to skip to content

Curated SQL Posts

Power Query Container Size and Performance

Chris Web looks into what changing the Power BI Dataflow container size does for us:

Currently there is no way to change this 256MB in Power BI Desktop or Excel although someone has already posted a suggestion on the Ideas site to allow us to change it. How much of an impact does this actually have on refresh performance though? Without the ability to change this setting it’s hard to say, but I suspect it could be significant and that a lot of Power Query performance problems could be explained by this behaviour.

The situation is different in the Power BI service, where I understand there is a limit on the overall amount of memory that a single Power Query query evaluation can use.

Read on to understand the differences here between running on Power BI Desktop and running in the Power BI service, as well as a bit of testing on Chris’s part.

Comments closed

Time Series Analysis with Forensic Accounting

I have another part up in my series on forensic accounting techniques:

This post will be a pretty short one. In my talk, I don’t have any demos, mostly because much of cohort analysis has secretly been time series analysis at the same time. Instead, I’ll lob out a few points and call it a day.

Time series analysis, at its core, is all about how your data changes over time. The grain for time series analysis is important: as we saw in the last post, we were able to get an excellent result at the yearly level when regressing number of active buses versus number of line items.

Spoilers: it’s not as short as I thought it would be.

Comments closed

Building an AKS Cluster with Azure DevOps

Rob Sewell shows how to use Azure DevOps to build an AKS cluster with Terraform:

In the last few posts I have moved from building an Azure SQL DB with Terraform using VS Code to automating the build process for the Azure SQL DB using Azure DevOps Build Pipelines to using Task Groups in Azure DevOps to reuse the same Build Process and build an Azure Linux SQL VM and Network Security Group. This evolution is fantastic but Task Groups can only be used in the same Azure DevOps repository. It would be brilliant if I could use Configuration as Code for the Azure Build Pipeline and store that in a separate source control repository which can be used from any Azure DevOps Project.

Luckily, you can

And Rob shows us how it’s done.

Comments closed

Aggregating Only Visible Values in DAX

Alberto Ferrari shows how you can calculate a sum based only on the visible values and ignoring the missing cases:

These measures produce the correct figures month by month. However, at the year level the number of working days might be too big whenever there are incomplete months – this always happens before the end of the year, or when the first month with sales is not January as in the following example. As you can see in the following figure, in this case the yearly value of SalesPerWorkingDays is lower than the value of any month, which does not make much sense.

Click through for the solution.

Comments closed

The Importance of Aliasing in Subqueries

Gail Shaw explains an unexpected result when writing a statement with a subquery:

The column name in the temp table is missing an I, probably just a typo, but it has some rather pronounced effects.

The obvious next question is why the select with the subquery in it didn’t fail, after all, the query asks for ClientID from #TempClients, and there’s no such column. However there is a ClientID column available in that query, and it’s in the Orders table. And that’s a valid column for the subquery, because column binding order, when we have subqueries, is first to tables within the subquery, and then, if no match is found, to tables in the outer query.

I know that the first time I experienced this, I thought it was a bug as well. But no, it’s by design and Gail explains why.

Comments closed

Permissions Requirements for ML Services

Niels Berglund looks at the permissions required to create external libraries with SQL Server Machine Learning Services:

This post is the fourth in a series about installing R packages in SQL Server Machine Learning Services (SQL Server ML Services). To see all posts in the series go to Install R Packages in SQL Server ML Services Series.

Why this series came about is a colleague of mine Dane pinged me and asked if I had any advice as he had issues installing an R package into one of their SQL Server instances. I tried to help him and then thought it would make a good topic for a blog post. Of course, at that time I didn’t think it would be more posts than one, but here we are.

These permissions are a bit more complicated than they might first appear to be.

Comments closed

Naive Bays in R

Zulaikha Lateef takes us through the Naive Bayes algorithm and implementations in R:

Naive Bayes is a Supervised Machine Learning algorithm based on the Bayes Theorem that is used to solve classification problems by following a probabilistic approach. It is based on the idea that the predictor variables in a Machine Learning model are independent of each other. Meaning that the outcome of a model depends on a set of independent variables that have nothing to do with each other. 

Naive Bayes is one of the simplest algorithms available and yet it works pretty well most of the time. It’s almost never the best solution but it’s typically good enough to give you an idea of whether you can get a job done.

Comments closed

Flink: Batch as a Special Case of Streaming

Fabian Hueske and Aljoscha Krettek describe streaming versus batch processing in Apache Flink:

The Apache Flink project has followed the philosophy of taking a unified approach to batch and stream data processing, building on the core paradigm of “continuous processing of unbounded data streams” for a long time. If you think about it, carrying out offline processing of bounded data sets naturally fits the paradigm: these are just streams of recorded data that happen to end at some point in time.

Flink is not alone in this: there are other projects in the open source community that embrace “streaming first, with batch as a special case of streaming,” such as Apache Beam; and this philosophy has often been cited as a powerful way to greatly reduce the complexity of data infrastructures by building data applications that generalize across real-time and offline processing.

Check it out. At the end, the authors also describe Blink, a fork of Flink being (slowly) merged back in and which supports this paradigm.

Comments closed

Multi-Tenant Security in Kudu + Impala

Grant Henke shows how you can combine Apache Impala’s fine-grained authorization with Apache Kudu’s coarse-grained authentication for multi-tenant scenarios:

Kudu supports coarse-grained authorization of client requests based on the authenticated client Kerberos principal. The two levels of access which can be configured are:
1. Superuser – principals authorized as a superuser are able to perform certain administrative functionality such as using the kudu command line tool to diagnose or repair cluster issues.
2.User – principals authorized as a user are able to access and modify all data in the Kudu cluster. This includes the ability to create, drop, and alter tables as well as read, insert, update, and delete data.

Access levels are granted using whitelist-style Access Control Lists (ACLs), one for each of the two levels. 

Read on to see how to tie it all together.

Comments closed

Load Testing Tools For SQL Server

Brent Ozar shares a list of load testing tools for SQL Server:

One thing I need you to understand first: you have to provide the database and the queries. Almost all of the tools in this post, except the last one, are designed to help you run queries, but they don’t include the queries. The whole idea with load testing is that you’re trying to mimic your own workloads. If you’re just trying to test a server with generic workloads, start with my post, “How to Check Performance on a New SQL Server.”

Click through for a list of tools. I’d also throw in Pigdog from Mark Wilkinson (one of my co-workers). This helped replicate a few issues in SQL Server 2017 around tempdb performance.

Comments closed