Hierarchical Data Cleansing With Power BI

Cedric Charlier has started a series on dealing with hierarchical data in a not-so-hierarchical format:

To load this Excel file in Power BI, I’ll just use standard functions and define a first table “Source” that won’t be enabled to load in report.

My next task will be to create a table (or dimension) with the different questions. I also want to include a hierarchy in this dimension: I should be able to browse the questions by categories and sub-categories.

Let’s create a new table named “Question” by referencing the “Source” table. Then remove the other columns than A and B.

The curse of Excel is that it’s so easy to build a data set in strange ways that make it hard to integrate later.

Diagnosing Execution Plan Oddities

Kendra Little digs into an oddly complex execution plan:

Aha! This is a definite clue. Some sort of security wizardry has been applied to this table, so that when I query it, a bunch of junk gets tacked onto my query.

I have no shame in admitting that I couldn’t remember at all what feature this was and how it works. A lot of security features were added in SQL Server 2016, and the whole point of a sample database like this to kick the tires of the features.

Kendra’s post frames it as an impostor syndrome check, whereas I read it as a murder mystery.

SSRS Log File Location Change

Wolfgang Strasser points out that SSRS log files are in a new directory structure for vNext:

The log files can be found in the Logfiles directory (it was the same directory also for the older versions). In SSRS vNext there more different log files..

The logging information seems to be splitted into multiple log files – if you want for example dig into the Power BI on-premises logging I propose to have a look at the RSPower*.log files.

This happens every once in a while, so it’s good to know when the log files move somewhere else.

DILM In Practice

Andy Leonard continues his series on data integration lifecycle management with a discussion of package workflow:

The “Stage EDW Data” Framework Application is identified by ApplicationID 2. If you recall, ApplicationID 2 is mapped to PackageIDs 4, 5, and 6 (LoadWidgetsFlatFile.dtsx, LoadSalesFlatFile.dtsx, and ArchiveFile.dtsx) in the ApplicationPackages table shown above.

The cardinality between Framework Applications and Framework Packages is many-to-many. We see an Application can contain many Packages. The less-obvious part of the relationship is represented in this example: a single Package can participate in multiple Applications or even in the same Application more than once. Hence the need for a table that resolves this many-to-many relationship. I hope this helps explain why Application Package is the core object of our SSIS Frameworks.

Read on for the rest of the story.

Checking Database Availability

Dimitri Furman explains that database availability is a trickier problem than it may first appear:

To check the availability of the database, the application executes the spCheckDbAvailability stored procedure. This starts a transaction, inserts a row into the AvailabilityCheck table, flushes the data to the transaction log to ensure that the write is persisted to disk even if delayed durability is enabled, explicitly reads the inserted row, and then rolls back the transaction, to avoid accumulating unnecessary synthetic transaction data in the database. The database is available if the stored procedure completes successfully, and returns a single row with the value 1 in the single column.

Note that an execution of sp_flush_log procedure is scoped to the entire database. Executing this stored procedure will flush log buffers for all sessions that are currently writing to the database and have uncommitted transactions, or are running with delayed durability enabled and have committed transactions not yet flushed to storage. The assumption here is that the availability check is executed relatively infrequently, e.g. every 30-60 seconds, therefore the potential performance impact from an occasional extra log flush is minimal.

This ends up being much more useful than a simple “is the service on?” heartbeat, as it shows that the database is available, not just that the engine is running.

Closing SSMS Tabs

Andy Mallon shows how to close a tab in SSMS and change the shortcut value:

CTRL + F4

Huzzah! This will close your query tab! Easy as pie. Also, note that CTRL+F4 is a pretty universal shortcut that works to close tabs/windows in other applications, too–including your favorite web browser. If you can switch your muscle memory away from CTRL+W to CTRL+F4, you can use that shortcut pretty much everywhere.

Click through for more information on changing shortcuts.

Combining Files In C#

Chris Koester shows how to combine a set of CSVs without duplicating their header rows:

The timings in this post came from combining 8 csv files with 13 columns and a combined total of 9.2 million rows.

I first tried combining the files with the PowerShell technique described here. It was painfully slow and took an hour and a half! This is likely because it is deserializing and then serializing every bit of data in the files, which adds a lot of unnecessary overhead.

Next I tried the C# script below using LINQPad. When reading from and writing to a network share, it took 3 minutes and 56 seconds. Much better! Next I tried it on a local SSD drive and it took just 44 seconds.

Read on for the script itself.  The ReadAllLines method works fine as long as the file isn’t larger than your working memory.

Categories

January 2017
MTWTFSS
« Dec Feb »
 1
2345678
9101112131415
16171819202122
23242526272829
3031