Press "Enter" to skip to content

Month: August 2020

Fixing Bad Data in Power BI

Matt Allington does the thing you shouldn’t (often) do:

Let me make this statement upfront and be clear. The best way to solve problems with source data are to go back to the source and correct the problems there. This is my recommendation on how you should solve such issues. However, sometimes that is not possible for whatever reason. This article will explain how you can use Power Query to override incorrect data during load when you can’t change it at the source, for whatever reason.

This is the classic BI tool quandry: the best solution is, as Matt mentions, to fix the source system. But when that’s not on the table—such as when you’re getting data from a third party—Matt has methods to work through data issues.

Comments closed

Release Rollback with Helm

Andrew Pruski shows the secret of how Helm lets you roll back releases even when deployments are deleted:

If we rollback with kubectl rollout undo the pods in the newest replicaset are deleted, and pods in an older replicaset are spun back up, rolling back the upgrade.

But there’s a potential problem here. What happens if that old replicaset is deleted?

If that happens, we wouldn’t be able to rollback the upgrade. Well we wouldn’t be able to roll it back with kubectl rollout undo, but what happens if we’re using Helm?

Read on to learn how the whole thing works.

Comments closed

What’s New in Kafka 2.6?

Randall Hauch takes us through the changes in Apache Kafka 2.6:

We’ve made quite a few significant performance improvements in this release, particularly when the broker has larger partition counts. Broker shutdown performance is significantly improved, and performance is dramatically improved when producers use compression. Various aspects of ACL usage are faster and require less memory. And we’ve reduced memory allocations in several other places within the broker.

This release also adds support for Java 14. And over the past few releases, the community has switched to using Scala 2.13 by default and now recommends using Scala 2.13 for production.

Finally, these accomplishments are only one part of a larger active roadmap in the run up to Apache Kafka 3.0, which may be one of the most significant releases in the project’s history. 

Read on to learn more.

Comments closed

EXTPTR_PTR Error with Rcpp

Rick Pack walks us through an error in R:

I experienced a need to update Rcpp when I attempted to install the readxlsb R package, which promised to enable me to read .xlsb files in R.

What happened next has been forgotten: Did the attempted update of Rcpp appear to succeed or fail? I did record that my attempted installation of readxlsb still failed and I now experienced an unfamiliar error when I opened and closed R Studio:

“The procedure entry point EXTPTR_PTR could not be located in the dynamic link library”

Read on to see how Rick solved this problem.

Comments closed

Creating Nonsense Documents with Powershell

Jeffrey Hicks has a nonsense generator:

Today I thought I’d share my PowerShell solution to a recent Iron Scripter challenge. The challenge was to create PowerShell code that would create nonsense documents, with a goal of creating 10 sample files filled with gibberish. Yes, other than maybe wanting some test files to work with, on its face the challenge appears pointless.  However, as with all of these challenges, or even the ones in The PowerShell Practice Primer, the journey is the reward. The true value is learning how to use PowerShell, and maybe discovering a new technique or command. The hope is that during the course of working on the challenge, you’ll improve your PowerShell scripting skills. And who knows, maybe even have a little fun along the way.

It’s not quite up to the level of quality that you find in post-modern academic papers, but it’s getting there.

Comments closed

The Dunder Mifflin Data Set

Tim Mitchell has a new data set for us:

I’ve been a fan of Dunder Mifflin ever since I first learned about this small midwestern paper company. Over the years I’ve gotten to know their people and processes, following from a distance their successesfailures, and various adventures. Who would have known the paper business would be so interesting?

Based on what I learned about this company, I built this Dunder Mifflin data set based on the old Northwind structure, adapting it to meet the needs of this small paper company. It includes most of the employees, regional locations (both current and now-closed), and has a modestly-sized set of sales data for demos and testing.

Check out Tim’s GitHub repo and give it a try.

Comments closed

Power BI Desktop: Analyze in Excel

Marco Russo has gone and done it now:

Less than a month ago, Microsoft introduced the External Tools feature in the Power BI Desktop July 2020 release. By using DAX Studio, you were already able to create a PivotTable in Excel connected to the model hosted by Power BI Desktop. However, this would require three clicks (DAX Studio / Advanced / Excel). This is why I thought the External Tools feature was something many users would like to use without having to open – or even install – a larger tool like DAX Studio is.

It’s interesting to see what the community has made so far from the External Tools feature.

Comments closed

The Performance Hit of Disabling the Identity Cache

Tibor Karaszi explains why you probably want to keep identity caching on:

Should you care about the gap? In most cases: no. The identity value should be meaningless. In many cases I think that it is just an aesthetic issue to not have these gaps. (I’ve seen cases where you do run into problems because of the gap, I should add – but not frequently.)

For the SEQUENCE object, we have the CACHE option to specify how many values to cache. I.e., max values we can jump if we have a hard shutdown.

For identity, we have the IDENTITY CACHE database scoped configuration, introduced in SQL Server 2017. Caching on or off. On is default. We also have trace flag 272, at the instance level.

However, disabling the caching isn’t free. 

In an ideal world, there are zero cases where you care about the gap. Identity integers and sequences are surrogate keys, and “surrogate” here means that it has no inherent business value—otherwise it’d be a natural key. Subsequently ascribing value to it is folly, and if you are in a scenario in which you need guaranteed sequences which always increase by exactly 1 and never have gaps (think something like check numbers or invoice numbers, things which accountants really want to see in a fixed order), identity integers and sequences aren’t the right tools for you.

But read on to see how much faster caching of identity values can make insert performance.

Comments closed

Modifying Graph Edges with T-SQL

Louis Davidson shows how you can update edges in SQL Server’s graph functionality:

As I have been writing a section on SQL Server graph tables in my Database Design book, (and prepping for a hopeful book on the subject next year), I find that there are a few really annoying things about dealing with graph tables. This blog serves to clear up the first, most annoying of them. Inserting, updating, and deleting edges.

Because the key values in the graph database structures are hidden, you can’t just insert a new edge without translating your table’s key values to the graph database internal values. Edges aren’t even available for an update of the from or to references. As I wrote stored procedures to do this, I realized “why not use a view and trigger to make this happen”. So I did. The result is that I can insert, delete, and even update graph tables using normal SQL syntax. What makes this better than the stored procedure is that I can insert multiple rows simultaneously.

Read on to see what this entails.

Comments closed

The Downside of EAV-Style Measures in Power BI

Chris Webb explains why you should try to stick to the fact-dimensional model in Power BI:

In this fact table the dimension keys remain the same, but the Value column stores all the data from the Sales, Tax and Volume Sold measures in the original table and the Measure Name column tells you what type of measure value is stored on any given row. Let’s call this approach the Measures Dimension approach.

There are some advantages to building fact tables using the Measures Dimension approach, for example:

– You can now use a slicer in a report to select the measures that appear in a visual
– You can now easily add new measures without having to add new columns in your fact table
– You can use row-level security to control which measures a user has access to

Generally speaking, though, any time you deviate from a conventional dimensional model you risk running into problems later on and this is no exception. Let’s go through the disadvantages of modelling data using a Measures Dimension.

Read on for several good reasons (and yes, “things are formatted wrong” is a good reason!).

Comments closed