Press "Enter" to skip to content

Month: January 2017

DILM In Practice

Andy Leonard continues his series on data integration lifecycle management with a discussion of package workflow:

The “Stage EDW Data” Framework Application is identified by ApplicationID 2. If you recall, ApplicationID 2 is mapped to PackageIDs 4, 5, and 6 (LoadWidgetsFlatFile.dtsx, LoadSalesFlatFile.dtsx, and ArchiveFile.dtsx) in the ApplicationPackages table shown above.

The cardinality between Framework Applications and Framework Packages is many-to-many. We see an Application can contain many Packages. The less-obvious part of the relationship is represented in this example: a single Package can participate in multiple Applications or even in the same Application more than once. Hence the need for a table that resolves this many-to-many relationship. I hope this helps explain why Application Package is the core object of our SSIS Frameworks.

Read on for the rest of the story.

Comments closed

Checking Database Availability

Dimitri Furman explains that database availability is a trickier problem than it may first appear:

To check the availability of the database, the application executes the spCheckDbAvailability stored procedure. This starts a transaction, inserts a row into the AvailabilityCheck table, flushes the data to the transaction log to ensure that the write is persisted to disk even if delayed durability is enabled, explicitly reads the inserted row, and then rolls back the transaction, to avoid accumulating unnecessary synthetic transaction data in the database. The database is available if the stored procedure completes successfully, and returns a single row with the value 1 in the single column.

Note that an execution of sp_flush_log procedure is scoped to the entire database. Executing this stored procedure will flush log buffers for all sessions that are currently writing to the database and have uncommitted transactions, or are running with delayed durability enabled and have committed transactions not yet flushed to storage. The assumption here is that the availability check is executed relatively infrequently, e.g. every 30-60 seconds, therefore the potential performance impact from an occasional extra log flush is minimal.

This ends up being much more useful than a simple “is the service on?” heartbeat, as it shows that the database is available, not just that the engine is running.

Comments closed

Closing SSMS Tabs

Andy Mallon shows how to close a tab in SSMS and change the shortcut value:

CTRL + F4

Huzzah! This will close your query tab! Easy as pie. Also, note that CTRL+F4 is a pretty universal shortcut that works to close tabs/windows in other applications, too–including your favorite web browser. If you can switch your muscle memory away from CTRL+W to CTRL+F4, you can use that shortcut pretty much everywhere.

Click through for more information on changing shortcuts.

Comments closed

Combining Files In C#

Chris Koester shows how to combine a set of CSVs without duplicating their header rows:

The timings in this post came from combining 8 csv files with 13 columns and a combined total of 9.2 million rows.

I first tried combining the files with the PowerShell technique described here. It was painfully slow and took an hour and a half! This is likely because it is deserializing and then serializing every bit of data in the files, which adds a lot of unnecessary overhead.

Next I tried the C# script below using LINQPad. When reading from and writing to a network share, it took 3 minutes and 56 seconds. Much better! Next I tried it on a local SSD drive and it took just 44 seconds.

Read on for the script itself.  The ReadAllLines method works fine as long as the file isn’t larger than your working memory.

Comments closed

Securing MapR

Mitesh Shah provides some high-level information on how to secure a MapR cluster:

  • Security Best Practice #2:  Require Authentication for All Services.  While it’s important for ports to be accessible exclusively from the network segment(s) that require access, you need to go a step further to ensure that only specific users are authorized to access the services running on these ports.  All MapR services — regardless of their accessibility — should require authentication.  A good way to enforce this for MapR platform components is by turning on security.  Note that MapR is the only big data platform that allows for username/password-based authentication with the user registry of your choice, obviating the need for Kerberos and all the complexities that Kerberos brings (e.g., setting up and managing a KDC). MapR supports Kerberos, too, so environments that already have it running can use it with MapR if preferred.

There’s nothing here which is absolutely groundbreaking, but they are good practices.

Comments closed

Powershell On Windows 10 Bash

Max Trinidad goes all the way with installing an Ubuntu environment on Windows:

It’s a known fact, if you install PowerShell Open Source in Windows 10 Bash subsystem, that it won’t work correctly. As soon as start typing $PSVersionTable and press enter, the cursor goes to the top of the screen. And, you keep typing and it gets very ugly.

Now, what if I tell you, I found the way to run PowerShell Open Source without any of these issues. Just like running it like it was installed in a Linux environment. No issues with the cursor going crazy and able to page up and down.

There are quite a few steps here, but Max lays them out clearly.

Comments closed

Sampling Data Lake Data

Alex Whittles shows how to use U-SQL to sample data to read in Power BI:

The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.

Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.

Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?

Read on for a couple of sampling methods.

Comments closed

TempDB And TDE

Bob Ward troubleshoots an oddity around sys.databases marking tempdb as encrypted even when no user databases are encrypted:

In my test I did not hit the breakpoint. And furthermore, you will notice that when you query sys.dm_database_encryption_keys, there is no row for tempdb at all.  So our debugger breakpoint proves that tempdb is not permanently encrypted. Instead, if ALL user databases have TDE disabled and you restart SQL Server, tempdb is no longer encrypted. So instead of using sys.databases, use sys.dm_database_encryption_keys to tell which databases are truly enabled for encryption. I then verified my findings in the source code. Basically, we only enable encryption for tempdb if 1) ALTER DATABASE enables any user db for TDE 2) When we startup a user database and have encryption enabled. I also verified the behavior with my colleagues in the Tiger Team (thank you Ravinder Vuppula). We will look at fixing the issue with sys.databases in the future (ironically as I said earlier it was never enabled for tempdb before SQL Server 2016).

Read on for Bob Ward’s patented Debugger Fun.  My takeaway from this is that sys.dm_database_encryption_keys is valid, whereas sys.databases’s is_encrypted column might not be.

Comments closed

Comparing Column Names

Jen McCown has a script to compare column names between tables to find case inconsistencies:

I’m reviewing the code for the upcoming Minion CheckDB, and one of the things we’re checking for is case consistency in column names. For example, if Table1 has a column named Col1, and Table2 has COL1, well that’s no good.

But, how do we easily find those mismatches on a system that’s not case sensitive? Easy: collations.

Click through for the script.

Comments closed