Press "Enter" to skip to content

Curated SQL Posts

Database Migrations With DBATools

Jason Squires has a series on database migrations using dbatools.  The first part covers capacity planning:

Great results, I now know that I cannot move the database from one server to another on any of the drives shown without expanding the drive volume.  Even though this was an awesome output, I wanted more. I know my data and log drives that I wanted to focus on so I decided to dump everything into an array.

The second part is the actual migration:

In this post, I am going to show you the steps that I decided to take to do this particular migration.  This migration was going to be a bit different. I am merging two systems together, so I had to really think the process through and ensure I get everything I needed. I also did not want to rollback due to the systems being legacy and unsupported. Some of the hosts were going to be pushing storage limits so I had to ensure everything went flawlessly so I could decommission the legacy systems quickly.

Both of these have scripts and explanations attached, making it easy to follow along.

Comments closed

A Columnstore Trick With No Practical Value

Joe Obbish explains a quirk of columnstore index compression:

The insert query now takes 3594 ms of CPU time and 2112 ms of elapsed time on my machine. The size of each rowgroup did not change. It’s still 2098320 bytes. Even though this is a parallel query there’s no element of randomness in this case. In the query plan we can see that the source table was scanned in a serial zone and round robin distribution is to used to distribute exactly half of the rows to each parallel thread.

This seems like a reasonable plan given that TOP forces a serial zone and we need to preserve order. It’s possible to rewrite the query to encourage a parallel scan of the source table, but that would introduce an order-preserving gather streams operator.

Click through for the full story.

Comments closed

Running DBCC CHECKTABLE

Arun Sirpal covers DBCC CHECKTABLE:

If you know about DBCC CHECKDB then most likely you will know about DBCC CHECKTABLE. Quite simply this command performs primitive system-catalog consistency checks, per-table consistency checks on the single table specified, and cross-table consistency checks on indexed views that refer­ence the specified table. (Page 899 Microsoft SQL Server Internals 2012, Chapter 14, Page 899, Paul Randal)

There is one concept about this command that I did not know about until recently, intrigued? Read on.

I was going to say “read on” but Arun already used that line.

Comments closed

An Example Of Error Message Ambiguity

Jason Brimhall has an interesting story around a common error message:

Now is the time to start pulling out your hair. There is no syntax error in the query. Go ahead and look it over 10-15 times. I know I did.

Ok, if there is no syntax error, then what could possibly be the real problem? Is the database corrupt? Maybe a system table is corrupt? Grasping at straws here, but could it possibly even be some sort of royally screwed up permissions.

Everything seems to be checking out properly. There is no corruption whatsoever. Laptop is soon to be launched at this point right? Ok, maybe not launched because this is a simple query. But, had this been a production related query that was rather intense and complicated, there really may be something getting launched as the frustration mounts.

Click through for the answer.  Sometimes the error message is technically correct but utterly confounding.

Comments closed

Housing Prices In Ames, Iowa: A Kaggle Competition

Kathryn Bryant and M. Aaron Owen share their Kaggle experiences.  First, Kathryn, et al:

The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.

Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but  we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing than anything else.

Then, Aaron, et al, share their process and findings:

Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.

Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.

This is the major upside to Kaggle:  it gives you the ability to work in a controlled environment with real data sets, which include real data problems.  Yeah, the data’s much cleaner than you’d experience in production pretty much anywhere, but that lets you practice technique with a relatively low barrier to entry.  H/T R-Bloggers (Kathryn | Aaron)

Comments closed

Docker And R

Mara Averick has some resources to help you get started with running R in a Docker container:

liftr 📦 by Nan Xiao

liftr aims to solve the problem of persistent reproducible reporting. To achieve this goal, it extends the R Markdown metadata format, and uses Docker to containerize and render R Markdown documents.

Click through for those resources as well as an addictive 8-bit animated GIF.

Comments closed

Binder: Hosting Jupyter Notebooks

Julia Evans points out a really interesting service:

Binder lets you easily host interactive Jupyter notebooks and let anyone on the internet use them interactively immediately! It uses JupyterHub under the hood.

If you want to try it out, you can do that right now:

  1. Go to https://mybinder.org/v2/gh/jvns/pandas-cookbook/master (which will launch the github.com/jvns/pandas-cookbook repository)
  2. Wait for it to build and click ‘launch’
  3. click ‘cookbook’, click a notebook, and play around! There’s an “A quick tour of the IPython Notebook” notebook which shows off some of the basic features.

It apparently uses Kubernetes + Docker under the hood which is interesting! It must be much much more expensive to run than the read-only services, but it’s such a useful and cool thing! I hope it continues to exist.

Definitely worth checking out.  I’m going to have to see the steps for getting an R runtime so I can post some of my own notebook repos.

Comments closed

Three Sessions And A Funeral

Solomon Rutzky explains what happens to sessions after they see the light at the end of the tunnel:

Sessions, in SQL Server, are born when a Connection is made from a client library to SQL Server. Temporary objects – Tables and/or Stored Procedures (yes, these are a thing) – may be created during a Session’s lifetime. The question is: for those temporary objects that are not explicitly dropped, what exactly happens to them? It is commonly known that they magically (ok fine, “automagically” — ok, ok, FINE, “automatically”) get dropped. But when do they get dropped? When the Session ends, right? And the Session ends when the Connection is closed, right? Well, that is certainly the common / conventional wisdom, at least. But is that understanding of the nature of Sessions and temporary objects correct?

It’s a more complicated topic than you might get from first appearances.

Comments closed

When Join Order Matters

Bert Wagner takes a look at one of the lesser appreciated tricks in performance tuning:

I had a great question submitted to me (thank you Brandman!) that I thought would make for a good blog post:

…I’ve been wondering if it really matters from a performance standpoint where I start my queries. For example, if I join from A-B-C, would I be better off starting at table B and then going to A & C?

The short answer: Yes. And no.

One of my favorite query tuning books is SQL Tuning by Dan Tow.  Parts of it are rather dated at this point—like pretty much anything involving a rule-based optimizer—but the gist still works well.  What it comes down to is finding the best single table from which to drive your query (based on table size, filters, etc.) and selecting the appropriate join order afterward.  It’s fairly time-consuming effort, but for the 0.5-1% of queries which really need it, it can be the difference between an awful plan and a good one.

Comments closed