Author: Kevin Feasel

Using Date Types In Warehouses

Published 2018-02-14 by Kevin Feasel

Koen Verbeeck argues that date keys in warehouses should be actual date types:

The worst are by far the string representation, as there is no actual check on the contents. It can literally contain everything. And is ’01/02/2018′ the first of February 2018 (like any sane person would read, because days come before months), or the 2nd of January? So if you have to store dates in your data warehouse, avoid strings at all costs. No excuses.

The integer representation – e.g. 20171208 – is really popular. If I recall Kimball correctly, he said it’s the one exception where you can use smart keys, aka surrogate keys that have a meaning embedded into them. I used them for quite some time, but I believe I have found a better alternative: using the actual date data type.

I bounce back and forth, but I’m sympathetic to Koen’s argument, which you can read by clicking through.

Comments closed

An Introduction To Splunk

Published 2018-02-14 by Kevin Feasel

Victoria Holt has some basics on Splunk:

Splunk, a software platform, has the capability to leverage machine data for data management and analytics. It can be used for

Data driven decision making

Alerts for network security threats

Report on system failures

Analyse and improve functionality

It enables performance analysis, dashboard creation, monitoring, troubleshooting and investigation of the real-time data collected. A Edureka learning video showed the Splunk components.

Advanced Splunk queries are still a bit like magic to me, but this is a very powerful service once you get a handle on how it works.

Comments closed

Appropriate Data Types And Unicode

Published 2018-02-14 by Kevin Feasel

Raul Gonzalez on (in)appropriate use of National character strings:

Yes, you have read it… I see dates stored as NVARCHAR(10) and NCHAR(10) on daily basis, please don’t ask me why.

This case is even worse, because DATE takes 3 bytes where NCHAR(10) takes 20 bytes, yes Ladies and Gentlemen more than 6 times more space to store the same data.

But wait! how can you be certain that those ten characters are actually a valid date? You can’t, unless you reinvent the wheel and validate that those dates are obviously valid dates and pay the performance penalty of doing it.

You’d think that picking the right data type for something would be fairly easy and then you find a table with a few dozen NVARCHAR(MAX) columns.

Comments closed

Database Mirroring And SQL Server Editions

Published 2018-02-14 by Kevin Feasel

Bob Pusateri points out something important about database mirroring:

A few weeks ago I had a reminder of one of the finer points of the requirements for mirroring: mirrored servers need to be running not only the same version of SQL Server, but the same edition as well.

I hadn’t thought of this in a while, but it makes sense. Asynchronous Database Mirroring (also known as “High-Performance Mode”) is only available in Enterprise Edition, while Standard Edition only supports “High Safety Mode”, which is synchronous. If the primary server in a mirroring topology was Enterprise Edition, but the mirror server was Standard Edition, how would that work?

Also check Glenn Berry’s comment for a trick which stretches mirroring’s capabilities for upgrades.

Comments closed

Investigating The gcForest Algorithm

Published 2018-02-13 by Kevin Feasel

William Vorhies describes a new algorithm with strong potential:

gcForest (multi-Grained Cascade Forest) is a decision tree ensemble approach in which the cascade structure of deep nets is retained but where the opaque edges and node neurons are replaced by groups of random forests paired with completely-random tree forests. In this case, typically two of each for a total of four in each cascade layer.

Image and text problems are categorized as ‘feature learning’ or ‘representation learning’ problems where features are neither predefined nor engineered as in traditional ML problems. And the basic rule in these feature discovery problems is to go deep, using multiple layers each of which learns relevant features of the data in order to classify them. Hence the multi-layer structure so familiar with DNNs is retained.

By using both random forests and completely-random tree forests the authors gain the advantage of diversity. Each forest contains 500 completely random trees allowed to split until each leaf node contains only the same class of instances making the growth self-limiting and adaptive, unlike the fixed and large depth required by DNNs.

The estimated class distribution forms a class vector which is then concatenated with the original feature vector to be the input of the next level cascade. Not dissimilar from CNNs.

The final model is a cascade of cascade forests. The final prediction is obtained by aggregating the class vectors and selecting the class with the highest maximum score.

Click through for how it fares on the normal sample data sets like MNIST.

Comments closed

Simple Data Transformation Tricks In R

Published 2018-02-13 by Kevin Feasel

Abdul Majed Raja has a few tidyverse-friendly data transformation tips:

Splitting a column to many columns is a cliched Data Transformation case that’s hardly unseen while performing Data Transformation. While it’s straightforward to do this in Microsoft Excel, it’s slightly tricky using Data analytics languages. That is true until this function separate() from tidyr came.

These are small but helpful tips.

Comments closed

SQLSharp 4.1 Released

Published 2018-02-13 by Kevin Feasel

Solomon Rutzky announces a new release of SQL#:

GENERAL
- Greatly reduced size (by approx. 310 kb) of main SQL# Assembly by moving LookUp category into its own Assembly: SQL#.LookUps. This will improve initial load times and won’t waste much memory when not using the LookUp functions.
Installation Script
- Account for security changes related to SQL Server 2017 (i.e. “CLR strict security“) using a Certificate (flexible, clean) instead of the new “Trusted Assemblies” (inflexible, messy).
Networking
- Added explicit support for TLS 1.1 and TLS 1.2 protocols
- Increased default “Connection Limit” for URIs to 20 from the .NET default of 2. This will reduce performance bottlenecks from concurrent access to the same URI.

SQL# is also the first place where I found a decent median function for SQL Server.

Comments closed

Tracking DBCC MEMORYSTATUS Over Time

Published 2018-02-13 by Kevin Feasel

Andy Galbraith has a script to track the outputs of DBCC MEMORYSTATUS over a time period:

Running this statement interactively doesn’t return any data – it just loads the data into DBADatabase.dbo.DBCCMemoryStatus. Running the commented-out SELECT at the bottom of the script as written will query that table for all rows of counter VM Reserved (virtual memory reserved) but there is much more data than that available if you modify the SELECT.

This query can be dropped into a SQL Agent job step as is and it will run – just like the interactive run it will create the database and permanent table if they don’t exist and then store those nuggets of data into the permanent table for later use – you never know when you may need them!

Click through for the script.

Comments closed

Understanding Row Goals

Published 2018-02-13 by Kevin Feasel

Paul White has an excellent article on row goals:

In my experience, it is very common for people to miss the impact of row goals in execution plans.

No wonder, since before SQL Server 2017 CU3, there was no documented way to see information about row goals in execution plans! That update includes:

Optimizer row goal information in query execution plan added in SQL Server 2017 CU3

This enhancement adds the EstimateRowsWithoutRowGoal attribute to each plan operator affected by a row goal.

The new attribute is visible in all the usual places (e.g. DMVs, plan cache) but not yet in SQL Server Management Studio graphical plans (up to and including SSMS version 17.4). The SQL Server does send the new attribute, but SSMS graphical plans strip bits out that do not match its local version of the xml plan schema.

Version 17.5 of SSMS is expected to ship with an updated xml schema and UI elements, making the new attribute visible. SentryOne Plan Explorer shows the new attribute in the raw xml view already, since it doesn’t have any reason to strip the information out, but an update will be required to incorporate the new row goal information into other places (such as the Plan Diagram). I would expect it to make its way onto tooltips first.

Definitely worth a careful read.

Comments closed

Getting SQL Server CPU Usage By Session

Published 2018-02-13 by Kevin Feasel

Manu Punna has a script to get CPU utilization by session over a relatively short timeframe:

Troubleshooting high CPU usage on a SQL Server Database is an art, but there is a defined methodology to follow to find the root cause of high CPU. This can involve breaking down the overall server CPU usage to a more granular level, first discovering that it’s SQL Server that’s the problem (because way too often it’s something else!), down to exploring specific plan operators in a particular problematic query. Finding that problematic query, identifying the high CPU consumer, means identifying the CPU usage by session.

sys.dm_exec_requests shows the CPU time, but it’s cumulative – it doesn’t give the CPU consumption by each session at the current time. You can see how much CPU usage a session has had since it started, but it doesn’t show you what’s going on right now. To explore that, we need to query sys.dm_exec_requests repeatedly, and look for the differences. We need to collect the CPU usage for a time interval to identify the high CPU consumers.

Click through for the script.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31