Press "Enter" to skip to content

Curated SQL Posts

NOLOCK Doesn’t Mean No Locks

Bert Wagner points out that SELECT queries with NOLOCK can still cause blocking to occur:

This is where my understanding of NOLOCK was wrong: while NOLOCKwon’t lock row level data, it will take out a schema stability lock.

schema stability (Sch-S) lock prevents the structure of a table from changing while the query is executing. All SELECT statements, including those in the read uncommitted/NOLOCK isolation level, take out a Sch-S lock. This makes sense because we wouldn’t want to start reading data from a table and then have the column structure change half way through the data retrieval.

However, this also means there might be some operations that get blocked by a Sch-S lock. For example, any command requesting a schema modification (Sch-M) lock gets blocked in this scenario.

Read on to see which types of commands take schema modification locks, and ways to minimize the pain.

Comments closed

tibbletime: Time-Aware Data Sets In R

At Business Science, they’ve announced a new R package:

We are excited to announce the release of tibbletime v0.0.2 on CRAN. Loads of new functionality have been added, including:

  • Generic period support: Perform time-based calculations by a number of supported periods using a new ~period formula~.
  • Creating series: Use create_series() to quickly create a tbl_time object initialized with a regular time series.
  • Rolling calculations: Turn any function into a rolling version of itself with rollify().
  • A number of smaller tweaks and helper functions to make life easier.

As we further develop tibbletime, it is becoming clearer that the package is a tool that should be used in addition to the rest of the tidyverseThe combination of the two makes time series analysis in the tidyverse much easier to do!

Check out their demos comparing New York and San Francisco weather.  It looks like it’ll be a useful package.  H/T R-bloggers

Comments closed

ANOVA

Mala Mahadevan explains what ANOVA is and why it’s interesting:

ANOVA – or analysis of variance, is a term given to a set of statistical models that are used to analyze differences among groups and if the differences are statistically significant to arrive at any conclusion. The models were developed by statistician and evolutionary biologist Ronald Fischer. To give a very simplistic definition – ANOVA is an extension of the two way T-Test to multiple cases.

ANOVA is an older test and a fairly simple process, but is quite useful to understand.

Comments closed

Updating A Collection Of Variables In Powershell

Klaas Vanbenberghe shows how to update multiple variables in the same way using Powershell:

We turn to two cmdlets: Get-Variable and Set-Variable. They may seem redundant as we get and set variables all the time without those. Maybe you have even never heard of these two cmdlets.
Well here they prove their usefulness:
we will set the variables using their names, and use the current value as a starting point.
An important rule to remember is this:

$ is a token to indicate we’re dealing with a variable, but it is not part of the variable name!

It’s worth reading the whole thing.

Comments closed

Neural Nets Optimizing For Imperfect

John Cook describes a paradox with neural nets:

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to deliberately do a mediocre job of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

Conceptually, this feels a little weird but isn’t really much of a problem, as we have other analogues:  rational ignorance in economics (where we knowingly choose not to know something because the benefit is not worth the opportunity cost of learning), OPTIMIZE FOR UNKNOWN with SQL Server (where we knowingly do not use the passed-in parameter because we might get stuck in a lesser path), etc.  But the specific process here is interesting.

Comments closed

COALESCE, ISNULL, And Data Types

Shane O’Neill shows a subtle difference between ISNULL and COALESCE:

You may be asking yourself “What the…?!”.

I asked myself the same question but then I thought of a better idea and asked my Senior the question instead (he usually has better answers than I do), who proceeded to tell me that the final test would only work exactly at midnight!

….so I repeated my question to him.

It’s an interesting read, and not something you’d commonly think about.

Comments closed

Backup Compression And Encryption

Arun Sirpal shows the combined effects of backup encryption and backup compression in SQL Server 2017:

Do not forget about the certificate! Warning: The certificate used for encrypting the database encryption key has not been backed up. Imagine if you need to recover the backup and you can’t?  You will get the dreaded thumbprint error.

Msg 33111, Level 16, State 3, Line 25 Cannot find server certificate with thumbprint ‘0x78FAB5A2A5D593FD3C4E163C90B745F70AB51233’. Msg 3013, Level 16, State 1, Line 25

RESTORE DATABASE is terminating abnormally.

So make sure you respect this certificate (and the key) and back it up and re-create them on the target server for a successful restore.

In SQL Server 2016 and 2017, there’s no reason not to encrypt backups; the marginal cost is practically nil even if you’re low enough on disk space that you need to do backup compression.

Comments closed

Use Fixed Filegrowth Settings

Andy Galbraith notes that you should use fixed-increment filegrowth settings for log and data files:

As you probably already know, the key flaw to percentage-based FILEGROWTH is that over time the increment grows larger and larger, causing the actual growth itself to take longer and longer.  This is especially an issue with LOG files because they have to be zero-initialized before they can be used, causing excessive I/O and file contention while the growth is in progress.  Paul Randal (blog/@PaulRandal) describes why this is the case in this blog post.  (If you ever get a chance to see it Paul also does a fun demo in some of his classes and talks on why zero initialization is importan, using a hex editor to read the underlying contents of disk even after the old data is “deleted”)

Andy also has a script to change filegrowth to fixed-increment growth depending upon the size of the file, so check that out.

Comments closed

Imbalanced Data In R

Rathnadevi Manivannan explains how to deal with imbalanced data using R:

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

Rathnadevi uses fraudulent transactions for his sample, but medical diagnoses is also a good example:  suppose 1 person in 10,000 has a particular disease.  You’re 99.99% right if you just say nobody has the disease, but that’s a rather unhelpful model.

Comments closed