Author: Kevin Feasel

The concept of “clean” data is appealing to us—I have a talk on the topic and spend more time than I’m willing to admit trying to clean up data. But the truth is that, in a real-world production scenario, we will never have truly clean data. Whenever there is the possibility of human interaction, there is the chance of mistyping, misunderstanding, or misclicking, each of which can introduce invalid results. Sometimes we can see these results—like if we allow free-form fields and let people type in whatever they desire—but other times, the error is a bit more pernicious, like an extra 0 at the end of a line or a 10-key operator striking 4 instead of 7.

Even with fully automated processes, we still run the risk of dirty data: sensors have error ranges, packets can get dropped or sent out of order, and services fail for a variety of reasons. Each of these can negatively impact your data, leaving you with invalid entries.

Read on for a few more adages which shape the way we work on projects, followed by an overview of the Microsoft Team Data Science Process.

Comments closed

Enabling Optimizer Fixes In SQL Server

Published 2018-02-23 by Kevin Feasel

Monica Rathbun explains that just upgrading a SQL Server database doesn’t enable optimizer changes:

When applying a new SQL Server cumulative update, hot fix, or upgrade SQL Server doesn’t always apply all the fixes in the patch. When you upgrade the database engine in-place, databases you had already stay at their pre-upgrade compatibility level, which means they run under the older set of optimizer rules. Additionally, many optimizer fixes are not turned on. The reason for this is that while they may improve overall query performance, they may have negative impact to some queries. Microsoft actively avoids making breaking changes to its software.

To avoid any negative performance impacts, Microsoft has hidden optimizer fixes behind a trace flag, giving admins the option to enable or disable the updated fixes. To take advantage of optimizer fixes or improvements you would have enable trace flag 4199 after applying each hot fix or update or set it up as a startup parameter. Did you know this? This was something I learned while working with an existing system, years into my career. I honestly assumed it would just apply any applicable changes that were in the patch to my system. Trace flag 4199 was introduced in the SQL Server 2005-era. In SQL Server 2014, when Microsoft made changes to the cardinality estimator they protected the changes with trace flags as well, giving you the option to run under compatibility level 120 and not have the cardinality estimator changes in effect.

Things changed starting with SQL Server 2016.

Click through to see how SQL Server 2016 made it a bit easier.

Comments closed

Log Shipping Tests With dbachecks

Published 2018-02-23 by Kevin Feasel

Sander Stad has a bonus post in his log shipping series:

We want everyone to know about this module. Chrissy LeMaire reached out to me and asked if I could write some tests for the log shipping part and I did.

Because I wrote the log shipping commands for dbatools I was excited about creating a test that could be implemented into this module for everyone to use.

That test is also quite easy to use, as Sander demonstrates.

Comments closed

Row Goals And Semi Joins

Published 2018-02-23 by Kevin Feasel

Paul White continues his row goals series:

The remaining physical join type is nested loops, which comes in two flavours: regular (uncorrelated) nested loops and apply nested loops (sometimes also referred to as a correlated or lateral join).

Regular nested loops join is similar to hash and merge join in that the join predicate is evaluated at the join. As before, this means there is no value in setting a row goal on either input. The left (upper) input will always be fully consumed eventually, and the inner input has no way to determine which row(s) should be prioritized, since we cannot know if a row will join or not until the predicate is tested at the join.

By contrast, an apply nested loops join has one or more outer references (correlated parameters) at the join, with the join predicate pushed down the inner (lower) side of the join. This creates an opportunity for the useful application of a row goal. Recall that a semi join only requires us to check for the existence of a row on join input B that matches the current row on join input A (thinking just about nested loops join strategies now).

In other words, on each iteration of an apply, we can stop looking at input B as soon as the first match is found, using the pushed-down join predicate. This is exactly the sort of thing a row goal is good for: generating part of a plan optimized to return the first n matching rows quickly (where n = 1 here).

This has the depth and quality that you naturally expect from Paul, making it an immediate read.

Comments closed

Digging Into dbachecks

Published 2018-02-23 by Kevin Feasel

Rob Sewell walks us through dbachecks configuration:

So I can import this configuration and run my checks with it any time I like. This means that I can create many different test configurations for my many different environment or estate configurations.

Yes, I know “good/best practice” says we should use the same configuration for all of our instances but we know that isn’t true. We have instances that were set up 15 years ago that are still in production. We have instances from the companies our organisation has bought over the years that were set up by system administrators. We have instances that were set up by shadow IT and now we have to support but cant change.

As well as those though, we also have different environments. Our development or test environment will have different requirements to our production environments.

This is a good, code-filled demonstration with a few videos as well. If you want to get started with dbachecks, this is a good place to begin.

Comments closed

Loading Data From Excel With Biml

Published 2018-02-23 by Kevin Feasel

Ben Weissman loads an Excel file with Biml:

Did you know, that you could call GetDatabaseSchema on Excel files? You can!

Just define an ExcelConnection first:

01_Environment.biml

1

2

3

4

5

6

7

8

9

10

11

12

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

    <Connections>

        <ExcelConnection Name="MyExcel" ConnectionString="Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Flatfiles\XLS\MyExcel.xlsx;Extended Properties="Excel 12.0 XML;HDR=YES";" />

        <OleDbConnection Name="Target" ConnectionString="Data Source=localhost;initial catalog=MySimpleBiml_Destination;provider=SQLNCLI11;integrated security=SSPI"></OleDbConnection>

    </Connections>

    <Databases>

        <Database Name="MySimpleBiml_Destination" ConnectionName="Target"></Database>

    </Databases>

    <Schemas>

        <Schema Name="dbo" DatabaseName="MySimpleBiml_Destination"></Schema>

    </Schemas>

</Biml>

You can call GetDatabaseSchema on that connection and loop through the tables just like any regular database.

01_Environment.biml
1 2 3 4 5 6 7 8 9 10 11 12	`<Biml` `xmlns="http://schemas.varigence.com/biml.xsd">` `<Connections>` `<ExcelConnection` `Name="MyExcel"` `ConnectionString="Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Flatfiles\XLS\MyExcel.xlsx;Extended Properties="Excel 12.0 XML;HDR=YES";"` `/>` `<OleDbConnection` `Name="Target"` `ConnectionString="Data Source=localhost;initial catalog=MySimpleBiml_Destination;provider=SQLNCLI11;integrated security=SSPI"></OleDbConnection>` `</Connections>` `<Databases>` `<Database` `Name="MySimpleBiml_Destination"` `ConnectionName="Target"></Database>` `</Databases>` `<Schemas>` `<Schema` `Name="dbo"` `DatabaseName="MySimpleBiml_Destination"></Schema>` `</Schemas>` `</Biml>`

Click through to see what to do with this connection.

Comments closed

Gotchas When Indexing Partitioned Tables

Published 2018-02-23 by Kevin Feasel

Andrew Pruski gives us a couple of considerations when creating indexes on partitioned tables in SQL Server:

Looking at that data page, we can see that SQL has added a UNIQUIFIER column. Now this is standard SQL behaviour, SQL does this to all non-unique clustered indexes whether they are on a partitioned table or not.

But also look at the CreatedDate column. It’s after the ID column on the page. If this was a non-partitioned table, we would see that after ColA & ColB (the order the columns are on the table). This has happened because SQL has implicitly added the partitioning key into the index definition, which has changed the physical order of the data on the page.

Read the whole thing.

Comments closed

Web Analytics With R

Published 2018-02-22 by Kevin Feasel

Maelle Salmon performs some analysis on the Locke Data blog:

Often, the URL of a blog post can be guessed based on its title, e.g. this one can be read here. But even if the transition from the Markdown file information to an URL is logical, it was best to get URLs from the in situ blog posts, and then join them to the blog post information collected previously, since some special characters got special treatment that I could not fully understand by looking at blogdown source code.

I first extracted all posts URLs from the website map.

Check it out.

Comments closed

Using stringr To Remove HTML

Published 2018-02-22 by Kevin Feasel

I have a quick post on removing HTML markup with stringr:

This is a quick post today on removing HTML tags using the stringr package in R.

My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer. I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.

Click through for the script. If you need to do something nice with the text afterward, my technique is probably too much sledgehammer for niceties, but it does the trick for pre-processing before vectorization.

Comments closed

The DATETIME Type In SQL Server

Published 2018-02-22 by Kevin Feasel

Randolph West gets into the DATETIME data type:

DATETIME is an eight-byte datatype which stores both a date and time in one column, with an accuracy of three milliseconds. As we’ll see though, the distribution of this granularity may not be exactly what we’d expect.

Valid DATETIME values are January 1, 1753 00:00:00.000, through December 31, 9999 23:59:59.997. On older databases designed prior to SQL Server 2008, because there was no explicit support for date values, it was sometimes customary to leave off the time portion of a DATETIME value, and have it default to midnight on that morning. So for example today would be stored as February 21, 2018 00:00:00.000.

If you’re not particularly familiar with SQL Server data types, this is detailed enough information to get you going and to explain exactly why you shouldn’t use DATETIME anymore…

Comments closed