Month: May 2017

Log Info DMF

Published 2017-05-19 by Kevin Feasel

Andrew Pruski looks at the new sys.dm_db_log_info dynamic management function in SQL Server 2017:

This new DMF is intended to replace the (not so) undocumented DBCC LOGINFO command. I say undocumented as I’ve seen tonnes of blog posts about it but never an official Microsoft page.

This is great imho, we should all be analyzing our database’s transaction log and this will help us to just that. Now there are other undocumented functions that allow us to review the log (fn_dblog and fn_dump_dblog, be careful with the last one).

I’m glad to see this make the cut, as replacing informative / idempotent DBCC commands with DMVs makes it easier to query this data, particularly in conjunction with other DMVs or system tables.

Comments closed

Stop Profiling

Published 2017-05-19 by Kevin Feasel

Wayne Sheffield concludes his two-part series on converting a trace to use Extended Events:

This query extracts the data from XE file target. It also calculates the end date, and displays both the internal and user-friendly names for the resource_type and mode columns – the internal values are what the trace was returning.

For a quick recap: in Part 1, you learned how to convert an existing trace into an XE session, identifying the columns and stepping through the GUI for creating the XE session. In part 2 you learned how to query both the trace and XE file target outputs, and then compared the two outputs and learned that all of the data in the trace output is available in the XE output.

Read the whole thing.

Comments closed

Why Restores Can Be Slow

Published 2017-05-19 by Kevin Feasel

Paul Randal explains why a database restoration tends to be slower than backing that database up:

Here’s a list of things you can do to make restoring a full backup go faster:

Ensure that instant file initialization is enabled on the SQL Server instance performing the restore operation, to avoid spending time zero-initializing any data files that must be created. This can save hours of downtime for very large data files.
If possible, restore over the existing database – don’t delete the existing files. This avoids having to create and potentially zero initialize the files completely, especially the log file. Be very careful when considering this step, as the existing database will be irretrievably destroyed once the restore starts to overwrite it.
Consider backup compression, which can speed up both backup and restore operations, and save disk space and storage costs.

It’s a straightforward explanation, and Paul provides a few more tips for speeding up restorations.

Comments closed

Getting A Date Is Hard

Published 2017-05-19 by Kevin Feasel

Nate Johnson has a fun rant about datetime ranges in SQL Server and date pickers:

I mean, I’m not that old, but spinning thru a few decades is still slower than just typing 4 digits on my keyboard — especially if your input-box is smart enough to flip my keyboard into “numeric only” mode.

Another seemingly popular date-picker UX is the “calendar control”. Oh gawd. It’s horrible! Clicking thru pages and pages of months to find and click (tap?) on an itty bitty day box, only to realize “Oh crap, that was the wrong year… ok let me go back.. click, click, tap..” ad-nauseum.

Food for thought.

Comments closed

Multiple R Studio Users On HDInsight

Published 2017-05-18 by Kevin Feasel

Xiaoyong Zhu shows how to set up additional R Studio users in an HDInsight cluster:

Basically speaking, the “http user” will be used to authenticate through the HDInsight gateway, which is used to protect the HDInsight clusters you created. This user is used to access the Ambari UI, YARN UI, as well as many other UI components.

The “ssh user” will be used to access the cluster through secure shell. This user is actually a user in the Linux system in all the head nodes, worker nodes, edge nodes, etc., so you can use secure shell to access the remote clusters.

For Microsoft R Server on HDInsight type cluster, it’s a bit more complex, because we put R Studio Server Community version in HDInsight, which only accepts Linux user name and password as login mechanisms (it does not support passing tokens), so if you have created a new cluster and want to use R Studio, you need to first login using the http user’s credential and login through the HDInsight Gateway, and then use the ssh user’s credential to login to RStudio.

It’s a good read and also includes a sample Spark-R job.

Comments closed

SQL Server Query Options

Published 2017-05-18 by Kevin Feasel

John Morehouse shows how to set your query options to be the same as that problematic application which is generating nasty queries:

Have you every executed a query in SQL Server Management Studio, looked at the execution plan, and noticed that it was a different plan than what was generated on the server?

A potential reason for this could be a different option settings. The options represent the SET values of the current session. SET options can affect how the query is execute thus having a different execution plan. You can find these options in two places within SSMS under Tools -> Options -> Query Execution -> SQL Server -> Advanced.

Click through for lots of information including a script John provided to see which options are currently on.

Comments closed

R Is Bad For You?

Published 2017-05-18 by Kevin Feasel

Bill Vorhies lays out a controversial argument:

I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years. I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms).

The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.

The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.

I have almost exactly the opposite thought on the matter: that drag-and-drop development is intolerably slow; I can drag and drop and connect and click and click and click for a while, or I can write a few lines of code. Nevertheless, I think Bill’s post is well worth reading.

Comments closed

Pretty R Plots

Published 2017-05-18 by Kevin Feasel

Simon Jackson has a couple posts on how to use ggplot2 to make graphs prettier. First, histograms:

Time to jazz it up with colour! The method I’ll present was motivated by my answer to this StackOverflow question.

We can add colour by exploiting the way that ggplot2 stacks colour for different groups. Specifically, we fill the bars with the same variable (x) but cut into multiple categories:

Then he follows up with scatter plots:

Shape and size

There are many ways to tweak the shape and size of the points. Here’s the combination I settled on for this post:

There are some nice tricks here around transparency, color scheme, and gradients, making it a great series. As a quick note, this color scheme in the histogram headliner photo does not work at all for people with red-green color-blindness. Using a URL color filter like Toptal’s is quite helpful in discovering these sorts of issues.

Comments closed

Designing A Data Warehouse Test Plan

Published 2017-05-18 by Kevin Feasel

Koos van Strien walks through some of the high-level concepts when automating data warehouse tests:

In my current project, I’ve got a database containing everything to perform these tests:

Tables with identical structure to the ones in the staging area (plus two columns “TestSuiteName” and “TestName”)

A table containing the mapping from test-input table to target database, schema and table

A stored procedure to purge the DWH (all layers) in the test environment

A stored procedure to insert the data for a specific testsuite / name

When preparing a specific test case (the “insert rows for test case” step from the diagram above), the rows needed for that case are copied into the DWH:

Testing warehouses is certainly not a trivial exercise but given how complex warehouse ETL tends to be, having good tests reduces the number of 3 AM pages.

Comments closed

Seeing Statistics In Execution Plans

Published 2017-05-18 by Kevin Feasel

Pedro Lopes announces that the statistics used to compile a plan are now available as part of the execution plan details:

OptimizerStatsUsage is available in cached plans, so getting the “estimated execution plan” and the “actual execution plan” will have this information.

In the above example, I see the ModificationCount is very high (almost as much as the table cardinality itself) which after closer observation, the statistic had been updated with NORECOMPUTE.

And looking and the Seek itself, there is a large skew between estimated and actual rows. In this case, I now know a good course of action is to update statistics. Doing so produces this new result: ModificationCounter is back to zero and estimations are now correct.

This will be a good addition to SQL Server 2017.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31