Press "Enter" to skip to content

Month: October 2019

Data Retrieval Bug Fixed for Columnstore Indexes

Dmitri Korotkevich takes us through an important bugfix in SQL Server:

The typical columnstore table is usually large and contains hundreds of millions or even billions of rows. Think about large fact tables in the data warehouses or huge transactional tables in OLTP systems. Those tables are usually partitioned. Besides usual reasons (Availability, Maintainability, etc), partitioning helps with the data load – it is easier to perform ETL in the staging table and import data through partition switch.

And here comes the problem. If you run OLTP query against partitioned clustered columnstore table and end up with the execution plan that uses index intersection of nonclustered B-Tree indexes, you may get incorrect results.

Getting the correct results in a query is pretty important (he says, with understatement), so this is an important bugfix; keep those SQL Server instances patched accordingly.

Leave a Comment

Column Masking with SSAS and Power BI

Dustin Ryan digs deep into the bag of tricks:

Last week I was asked to tackle a requirement by a customer adopting Analysis Services to enable data exploration and ad hoc analysis by their users. One of their requirements was to secure columns based on a grant related to a cost center. For example, a grant has several attributes, with some attributes being considered “sensitive” and other attributes considered “non-sensitive”. Non-sensitive grant attributes would accessible to all users while a subset of the attributes in the grant table considered “sensitive” would be accessible to users related to the corresponding cost center. The challenge here is that while Analysis Services supports column level security, dynamic column level security is not supported. So my colleague and friend, the great Steve Pontello, and I put our heads together to address the requirement.

Read on to see how they did it and the resulting solution’s limitations.

Leave a Comment

Automatic Power BI Page Rotation

Kasper de Jonge shows how you can make Power BI report pages automatically rotate:

This question comes up pretty regularly, I have a big screen in my hallway and I want to show some Power BI reports that rotate. Now there is some build in functionality in the Windows 10 App for Power BI for it that you can check out here. But there have been some cases where this doesn’t work, like for example if you want your report to run outside of your domain using B2B or when using Power BI embedded that both cannot load the report in the Win 10 app.

For those cases, Kasper has a solution involving an external Chrome add-on.

Leave a Comment

Overlooked T-SQL Functions

Itzik Ben-Gan covers some underutilized functions and function overloads in T-SQL:

TRIM is more than LTRIM(RTRIM())
SQL Server 2017 introduced support for the function TRIM. Many people, myself included, initially just assume that it’s no more than a simple shortcut to LTRIM(RTRIM(input)). However, if you check the documentation, you realize that it’s actually more powerful than that.

This article is an excellent argument in favor of reading the documentation, as all of it is in there but it’s easy to miss.

Leave a Comment

Date and Time Aggregation in T-SQL

Daniel Hutmacher shows how you can aggregate date and time data types in SQL Server:

Because the “Duration” column is a “time(3)” datatype, it doesn’t really aggregate into an average that easily.

Msg 8117, Level 16, State 1, Line 20 Operand data type time is invalid for avg operator.

This may seem a little odd, as time is really a continuum just like any integer or floating-point value, right?

Read on to see how to do this and a warning from Daniel about overflowing. This sort of aggregation is a lot easier to do in R, but you can still do it in T-SQL.

Leave a Comment

Differences in Spark RDDs and DataSets

Brad Llewellyn looks at some of the differences between RDDs and DataSets in Spark:

We see that there are some differences between filtering RDDsData Frames and Datasets.  The first major difference is the same one we keep seeing, RDDs reference by indices instead of column names.  There’s also an interesting difference of using 2 =’s vs 3 =’s for equality operators. Simply put, “==” tries to directly equate two objects, whereas “===” tries to dynamically define what “equality” means.  In the case of filter(), it’s typically used to determine whether the value in one column (income, in our case) is equal to the value of another column (string literal “<=50K”, in our case).  In other words, if you want to compare values in one column to values in another column, “===” is the way to go.

Interestingly, there was another difference caused by the way we imported our data.  Since we custom-built our RDD parsing algorithm to use <COMMA><SPACE> as the delimiter, we don’t need to trim our RDD values.  However, we used the built-in sqlContext.read.csv() function for the Data Frame and Dataset, which doesn’t trim by default.  So, we used the ltrim() function to remove the leading whitespace.  This function can be imported from the org.apache.spark.sql.functions library.

Read on for more, including quite a few code samples.

Leave a Comment

The Flexible Data Lake

Neil Stokes explains how you can optimize a Hadoop-based data lake:

There are many details, of course, but these trade-offs boil down to three facets as shown below.

Big refers to the volume of data you can handle with your environment. Hadoop allows you to scale your storage capacity – horizontally as well as vertically – to handle vast volumes of data.

Fast refers to the speed with which you can ingest and process the data and derive insights from it. Hadoop allows you to scale your processing capacity using relatively cheap commodity hardware and massively parallel processing techniques to access and process data quickly.

Cheap refers to the overall cost of the platform. This means not just the cost of the infrastructure to support your storage and processing requirements, but also the cost of building, maintaining and operating the environment which can grow quite complicated as more requirements come into play.

The bottom line here is that there’s no magic in Hadoop. Like any other technology, you can typically achieve one or at best two of these facets, but in the absence of an unlimited budget, you typically need to sacrifice in some way.

Software development is full of trade-offs, and data lakes are no different. Read the whole thing.

Leave a Comment

Debugging with PerfView

Erik Darling takes us through PerfView:

If you seriously want to debug, you’ve gotta use something like WinDbg. It’s tough, but powerful.

It’s also… Where do you even start? The most common thing to do is set a Break Point.

Figuring out where you wanna set it can be challenging.
– Sometimes you can capture a thread and step through the code.
– Sometimes you can find the call stack you want by searching through symbols

The problem with these is that stepping through code that runs in milliseconds has taken me over an hour at times, and call stacks often have cryptic names.

This is where tools like PerfView can help.

Erik does an excellent job going through PerfView and introducing people to debugging SQL Server.

Leave a Comment

Using Azure Kubernetes Services for Big Data Clusters

Mohammad Darab explains why it’s a good idea to use Azure Kubernetes Service when building out a Big Data Cluster:

According to the Microsoft documentation, there are three ways to deploy a Big Data Cluster:

1. Minikube
2. Kubeadm
3. AKS

I’ll go into each and list the pros and cons.

Of course, if you have a great Kubernetes admin, on-prem is certainly a viable option, but AKS is definitely easier to get started with.

Leave a Comment