Author: Kevin Feasel

A friend had an interesting problem today. A really big table (multiple millions of rows) and no primary key. He then ran into an issue where he had to have one. The easiest thing is to create a new int column with an identity column right? Unfortunately in this case because of the size of the table, the log growth on adding an identity column was too much. So what to do?

Well, it would be nice if we could add an int column, populate it in chunks, then make it an identity column. Unfortunately, you can’t add identity to an existing column.

Read on for the answer.

Comments closed

Static Site Generation With Hugo

Published 2017-06-20 by Kevin Feasel

Steph Locke explains how to build a simple site using Hugo:

This site uses Hugo. Hugo is a “static site generator” which means you write a bunch of markdown and it generates html. This is great for building simple sites like company leafletware or blogs.

You can get Hugo across platforms and on Windows it’s just an executable you can put in your program files. You can then work with it like git in the command line.

Read on for a step-by-step process to get started. Steph also links to blogdown, which is an interesting R-friendly extension.

Comments closed

Using Buffer Pool Extension

Published 2017-06-20 by Kevin Feasel

William Wolf talks about Buffer Pool Extension:

With this feature, SQL Server will extend the Buffer Pool Cache to non-volatile(ssd) storage. This will alleviate the I/O contention of mechanical disks by augmenting memory. The BPE uses the SSD as memory extension rather than disk. This feature can be used with standard and enterprise, but would provide noticeable benefits for Standard Edition. According to books online, the BPE size can be up to 32 times(Enterprise) or 4 times(Standard Edition) the value of max_server_memory, but the recommended ratio is 1:16 or less.

By utilizing this option, we can alleviate some memory pressure. To demonstrate this for me was a litte difficult at first. My laptop, as most newer laptops, has a SSD. So I plugged in a SATA hard drive externally and moved my database there for testing. If the database files are already on SSD, adding BPE may not give much benefit as the memory from BPE would write to SSD as well.

Buffer Pool Extension did end up in the Hall of Shame, but scenarios like Wolf describes exist, and in those scenarios, BPE could be a viable third-best option.

Comments closed

A Filesystem For DMVs

Published 2017-06-20 by Kevin Feasel

Anthony Nocentino shows how Microsoft is embracing the Linux style by creating a view of DMVs as a filesystem:

Something isn’t right…as DBAs we think of things in rows and columns. So we’re going to count across the top and think the 7th column is going to yield the 7th column and it’s data for each row, right? Well, it will but data processed by awk is whitespace delimited by default and is processed row by row. So the 7th column in the second line isn’t the same as the output in the first line. This can be really frustrating if your row data has spaces in it…like you know…dates.

So let’s fix that…the output from the DMVs via dbfs is tab delimited. We can define our delimiter for awk with -F which will allow for whitespaces in our data. Breaking the data only on the tabs. Let’s hope there isn’t any tabs in our data!

I’m a little surprised that these metrics don’t end up in /proc, but I imagine there’s a reason for that.

Comments closed

TempDB Encryption With TDE

Published 2017-06-20 by Kevin Feasel

Arun Sirpal points out an oddity in sys.databases:

If you query sys.databases, such as:

SELECT is_encrypted,name,user_access_desc FROM sys.databases WHERE database_id = 2 OR database_id = 7

It “might” throw you off. Would you not expect to see is_encrypted set to 1 for TempDB?

I thought I remembered earlier editions of SQL Server showing is_encrypted = 1 for tempdb, and I definitely remember 2016 showing 0 even when the database is encrypted.

Comments closed

Trivial Plans And Columnstore Indexes

Published 2017-06-20 by Kevin Feasel

Erik Darling warns us that trivial plans against clustered columnstore indexes could lead to row execution rather than batch execution:

Let’s look at one query with a few variations.
SELECT COUNT(*) AS [Records], SUM(CONVERT(BIGINT, t.Amount)) AS [Total]
FROM   dbo.t1 AS t
WHERE  t.Id > 0
       AND t.Id < 3;
The plan for it is alright. It’s fairly straightforward and the query finishes in about 170ms.

We can see from the graphical execution plan that it’s been Simple Parameterized. SQL Server does this to make plan caching more efficient.

Check out the entire post.

Comments closed

Improving Solr Performance

Published 2017-06-19 by Kevin Feasel

Michael Sun has some tips to improve performance of Solr operations, focusing on memory tuning but including a few other tips as well:

For time series applications, it’s very common to have queries in the following pattern

q=*:*&fq=[NOW-3DAYS TO NOW]

However, this is not a good practice from memory perspective. Under the hood, Solr converts ‘NOW’ to a specific timestamp, which is the time when the query hits Solr. Therefore, two consecutive queries with the same field query fq=[NOW-3DAYS TO NOW] are considered different queries once ‘NOW’ is replaced by the two different timestamp. As a result, both of these queries would hit disk and can’t take advantage of caches.

In most of use cases, missing data of last minute is acceptable. Therefore, try to query in the following way if your business logic allows.

q=*:*&fq=[NOW/MIN-3DAYS TO NOW/MIN]

If you’re using Solr for full text search, this is rather useful information.

Comments closed

JSON In Powershell

Published 2017-06-19 by Kevin Feasel

Adam Bertram shows an easy way of dealing with JSON data inside Powershell:

As the world continually becomes “eaten by software,” more and more services are being replaced by software. IT pros have most likely seen this in the form of software-defined everything. One of the premier components of this focus on software and with the continuing adoption of DevOps is application programming interfaces (APIs). All of these services needs to talk together and must provide a way for programs and users to interact with them. This is where APIs come in handy. But, what does this have to do with PowerShell and JSON, you ask?

APIs, more specifically REST APIs, return data when queried. This data is typically in the JSON format. JSON is a way of structuring data that makes it easy for software to consume. When working with PowerShell, Microsoft has provided some helpful tools to work with JSON called the ConvertTo-Json and ConvertFrom-Json commands. These commands allow you to quickly work with REST APIs or any other service that returns or accepts JSON as an input.

Read on for more details on how to use these commands.

Comments closed

Fraud Detection With Python

Published 2017-06-19 by Kevin Feasel

Kevin Jacobs has a walkthrough of how to use Pandas and scikit-learn to perform fraud detection against a sample set of credit card transactions:

Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.

There is a small trick for getting more information than only the raw records. We can use the following code:
print(df.describe())
This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.

The Kaggle competition data set is available, so you can follow along.

Comments closed

Forcing 0 Intercept Inflates R-squared In R

Published 2017-06-19 by Kevin Feasel

John Mount has an informative post on how you can trick yourself when running linear regression models in R and forcing the y intercept to be 0:

So far so good. Let’s now remove the “intercept term” by adding the “0+” from the fitting command.
m2 <- lm(y~0+x, data=d)
t(broom::glance(m2))
##                        [,1]
## r.squared      7.524811e-01
## adj.r.squared  7.474297e-01
## sigma          3.028515e-01
## statistic      1.489647e+02
## p.value        1.935559e-30
## df             2.000000e+00
## logLik        -2.143244e+01
## AIC            4.886488e+01
## BIC            5.668039e+01
## deviance       8.988464e+00
## df.residual    9.800000e+01
d$pred2 <- predict(m2, newdata = d)
Uh oh. That appeared to vastly improve the reported R-squared and the significance (“p.value“)!

Read on to learn why this happens and how you can prevent this from tricking you in the future.

Comments closed