Press "Enter" to skip to content

Month: March 2019

Date Dimensions and Large Power BI Files

Reza Rad explains why your Power BI data size might be abnormally large:

If you have a large *.pbix file, you can investigate what are the columns and tables that causing the highest storage consumption, using Power BI Helper. You can download Power BI Helper for free from here. I opened the file above in Power BI Helper, and in the Modeling Advise tab, this is what I see:

As you can see in the above output, the Date field in the Date table is the biggest column in this dataset. taking 150MB runtime memory! This is considering that we have only three distinct values in the column! Seems a bit strange, isn’t? let’s dig into the reason more in deep.

Read on for Reza’s explanation and what you can do to fix it.

Comments closed

Pacemaker Changes Affecting SQL on Linux

Randolph West has an important message if you’re running SQL Server on Linux:

Heads up for SQL Server on Linux folks using availability groups and Pacemaker. Pacemaker 1.1.18 has been out for a while now, but it’s worth mentioning that there was a behaviour change in how it fails-over a cluster. While the new behaviour is considered “correct”, it may affect you if you’ve configured availability groups on a previous version (specifically 1.1.16).

Click through for more details and what you can do about this.

Comments closed

Persisting Databases with Docker Compose

Rob Sewell tries out docker-compose to persist databases using named volumes:

With all things containers I refer to my good friend Andrew Pruski. Known as dbafromthecold on twitter he blogs at https://dbafromthecold.com

I was reading his latest blog post Using docker named volumes to persist databases in SQL Server and decided to give it a try.

His instructions worked perfectly and I thought I would try them using a docker-compose file as I like the ease of spinning up containers with them.

Read on for Rob’s travails, followed by great success. And never go into something named “the spooky basement;” that’s just good life advice.

Comments closed

Apache Druid Concepts

Jatin Demla takes us through some of the key concepts behind Apache Druid:

Apache Druid is a distributed, high-performance columnar store for real-time analytics on a large dataset. Druid core design combines the OLAP analytics, time series database and search system to create a single operational analysis. Druid is most suitable for data with high cardinality column or queries having higher aggregation or group by.

Druid has very specific use cases. If you don’t fit one of the use cases, it’s not a good solution at all; but if you do fit one of the use cases, it’s excellent.

Comments closed

Residual Analysis with R

Abhijit Telang shares a few techniques for doing post-regression residual analysis using R:

Naturally, I would expect my model to be unbiased, at least in intention, and hence any leftovers on either side of the regression line that did not make it on the line are expected to be random, i.e. without any particular pattern.

That is, I expect my residual error distributions to follow a bland, normal distribution.

In R, you can do this elegantly with just two lines of code. 
1. Plot a histogram of residuals 
2. Add a Quantile-Quantile plot with a line that passes through, namely, the first and third quantiles.

There are several more techniques in here to analyze residuals, so check it out.

Comments closed

Generating TPC-DS Data Sets with HDInsight

Chris Koester shows how you can generate artificial data sets in the TCP-DS format using HDInsight:

This post describes how to generate big datasets with Hive in HDInsight, specifically TPC-DS benchmarking datasets. There are many tools for generating sample data, and this one is particularly nice due to its familiarity and ability to generate massive datasets up to 100 terabytes in size. The intended purpose of TPC data is for benchmarking purposes, but big sample datasets are also very useful for learning big data tools, proofs of concept, testing, etc.

The TPC (Transaction Processing Performance Council) provides tools for generating the benchmarking data, but using them to generate big data is not trivial, and would take a very long time on modest hardware. Thankfully someone has written a nice utility that uses Hive and Python to run the generator on a Hadoop cluster. While Hadoop clusters are not easy to setup, using a Hadoop cloud service like Azure HDInsight is remarkably easy. With HDInsight, you can use a powerful cluster of machines to generate the data quickly, and when you’re done you can delete the cluster, leaving the data in place.

Most of the instructions should follow through to work with on-prem or non-HDInsight Hadoop clusters, though there will be some changes to accommodate differences in HDInsight.

Comments closed

Row Versioning and 14 Bytes

Kendra Little explains why enabling row versioning adds 14 bytes per row:

I love it when someone sends me a repro script, but in this case I didn’t need to run it. The first thing I did was to look at the two numbers given for row size, and to subtract the smaller one from the larger one: 724 – 710 = 14 bytes of difference.

That bit of information alone gave me an immediate guess of what was going on.

Click through for the solution as well as a more detailed explanation of one of the trickier scenarios.

Comments closed

Auto-Escaping XML Characters

Emanuele Meazzo shows how you can auto-escape XML characters using T-SQL:

Recently I had to look up the definition for a bunch of SQL objects and didn’t want to manually retrieve them manually in SSMS (with Create Scripts) or Visual Studio (by searching the object name in my TFS repository).

Since lazyness and automation are the basis of a well done engineering work, I wanted to create a list, where I could basically click on the object that I needed and see the definition right away, without any tool or having to code something externally, of course.

Click through for the solution, which is short and sweet.

Comments closed

Cancelling Resumable Index Creation

Brent Ozar takes us through a couple considerations when using online, resumable index creation:

In SSMS, you’re used to being able to click the “Cancel” button on your query, and having your work rolled back.

You’re also used to being able to kill a query, and have it automatically roll back.

Neither of those are true with resumable index creations. In both cases, whether you kill the index creation statement or just hit the Cancel button in SSMS to abort your request, your index creation statement is simply paused until you’re ready to come back to it. (Or, it’s ready to come back to haunt you, as we saw above.)

There are some good things to keep in mind here.

Comments closed

Finding Missing Values with Tally Tables

David Fowler shows one way to find missing values using a tally table:

This is going to be a bit of a brain storming post that comes from an interesting question that I was asked today…

“I’ve got a table with a ID code field, now some of the rows have a value in that field and some are NULL. How can I go about filling in those NULL values with a valid code but at the same time avoid introducing duplicates?”

Click through for David’s solution.

Comments closed