Press "Enter" to skip to content

Curated SQL Posts

Using VALUES for Multi-Record Operations

Daniel Hutmacher explains some of what you can do with the VALUES clause:

Note the commas at the end of each line, denoting that a new row begins here. Because this runs as a single statement, the INSERT runs as an atomic operation, meaning that all rows are inserted, or none at all (like if there’s a syntax issue or a constraint violation).

I use this construct all the time to generate scripts to import data from various external sources, like Excel, or even a result set in Management Studio or Azure Data Studio.

Daniel also has a new app for us to try out.

Comments closed

Using Docker Volumes to Hold SQL Server Databases

John Morehouse shows how to use volumes to expose data—such as SQL Server data and log files—to a Docker container:

Over the past couple of blog posts, I have been talking about the versatility of deploying SQL Server with Docker.  This combination is a great way to quickly and easily spin up local SQL Server instances.  In the most recent post, I talked about a method to copy and restore a sample database into a Docker container that is running SQL Server.   In this post, I am going to talk about an easier way to accomplish this by attaching a persistent volume to the container.   With this method you don’t have to copy any files into the container, and it makes the overall process easier and repeatable.

First, before we get into the code, let’s talk about what a volume is.  Essentially, a volume is a location on the host machine that can be referenced by the container.  I think of this as a shared folder that the container can see.  Once attached to the container, it can then read or write to the volume.   You can easily declare the volume when you create the container with a simple switch in the command.

John’s examples are on a Mac, but the concepts are essentially the same for Windows or Linux.

Comments closed

Kubernetes on Virtualized Hardware

Chris Adkin gives us the pros and cons of running Kubernetes on virtual hardware:

A full discussion on Kubernetes security is beyond the scope of this blog post. However, the Mitre Att&ck Framework provides a comprehensive matrix of security attack patterns. Microsoft have produced a similar style of matrix to cover Kubernetes in this blog. As per the blog, resource hijacking and lateral movement have ramifications for multi-tenant platforms and Kubernetes application delivery techniques via things such as GitOps – where you may have one Kubernetes cluster per code branch. Putting nodes in their own virtual machines, provides an extra layer of defense that can reduce the impact of pods that might become malicious as the result of an attack. VMware vSphere 7.0 (more on this later) takes this concept further by running each pod in its own light weight virtual machine.

Click through for a breakdown of each side’s arguments.

Comments closed

Key Lookups and Self-Joins

Erik Darling has an interesting method for eliminating key lookups:

This post isn’t going to go terribly deep into anything, but I do want to make a few things about them more clear, because I don’t usually see them mentioned anywhere.

1. Lookups are joins between two indexes on the same table
2. Lookups can only be done via nested loops joins
3. Lookups can’t be moved around in the execution plan

I don’t want you to think that every lookup is bad and needs to be fixed, but I do want you to understand some of the limitations around optimizing them.

Definitely worth the read.

Comments closed

Good Practices when Combining Spark with Cassandra

Valerie Parham-Thompson shares some insights for working with Spark and Cassandra together:

Although we are focusing on Cassandra as the data storage in this presentation, other storage sources and destinations are possible. Another frequently used data storage option is Hadoop HDFS. The previously mentioned spark-cassandra-connector has capabilities to write results to Cassandra, and in the case of batch loading, to read data directly from Cassandra.

Native data output formats available include both JSON and Parquet. The Parquet format in particular is useful for writing to AWS S3. See https://aws.amazon.com/about-aws/whats-new/2018/09/amazon-s3-announces-new-features-for-s3-select/ for more information on querying S3 files stored in Parquet format. A good use case for this is archiving data from Cassandra.

Read on for more advice.

Comments closed

The Either Monad in Scala

Jyoti Sachdeva explains some of the power of Either:

We use Options in scala but why do we want to go for Either?

Either is a better approach in the respect that if something fails we can track down the reason, which in Option None case is not possible.
We simply pass None but what is the reason we got None instead of Some. We will see how to tackle this scenario using Either.

This is a classic functional programming pattern and one of the easier monads to understand.

Comments closed

ML Services and Resource Governor

I have a post on two gotchas you might run into around Resource Governor throttling SQL Server Machine Learning Services:

By default, SQL Server will grant 20% of available memory to any R or Python scripts running. The purpose of this limit is to prevent you from hurting server performance with expensive external scripts (like, say, training large neural networks on a SQL Server).

Here’s the kicker: this affects you even if you don’t have Resource Governor enabled. If you see out-of-memory exceptions in Python or error messages about memory allocation in R, I’d recommend bumping this max memory percent up above 20, and I have scripts to help you with the job. Of course, making this change assumes that your server isn’t stressed to the breaking point; if it is, you might simply want to offload that work somewhere else.

Click through for the other issue.

Comments closed

ADF.Procfwk Version 1.8

Paul Andrew has been busy:

Following more great feedback from the Data Platform community the primary goal of this release was to further improve the resilience of the framework processing. These improvements included its restart clean up capabilities and introducing better dependency chain handling between Worker pipelines when failures occur. The latter builds on the existing restart functionality first introduced in release v1.2 and supplements the logic using a new set of pipeline dependency metadata. I’ve created the below visual to conceptually show the new dependency chain behaviour, should you wish to populate and make use of the new metadata handling.

Read on for the full changelog.

Comments closed

A Note on Distributed Network Names

Allan Hirt provides an explanation around Distributed Network Names when building Windows Server Failover Clusters on Windows Server 2019:

The new Windows Server 2019 DNN functionality does have a side effect that does affect Azure-based configurations. When creating a WSFC, Windows Server 2019 detects that the VM is running in Azure and will use a DNN for the WSFC name. This is the default behavior.

I clipped this paragraph specifically because Allan uses both “affect” and “effect” correctly, and I wanted to call that out. Do read the rest of it as well.

Comments closed

Understanding the Identity Cache

Jignesh Raiyani explains the purpose of the identity cache and shows how you might disable it if you want to:

In the release of SQL Server 2012, Microsoft has introduced a feature of Identity Cache. The identity jump doesn’t cause any issue neither with the database nor the running tasks; however, this identity value gap is not acceptable in some of the business-oriented scenarios. This Identity counter values of the columns are stored in the system table separately, which is known as sys.identity_columns DMV. This reference object sys.identity_columns does not rely on the transaction status, and it doesn’t matter that the transaction on the user table is committed or rolled back. Now, in case this user transaction is being rolled back, we will find that the identity value has been skipped in the sys.identity_columns table. The reason being is that the value of the Identity column value does not get rolled back. Apart from the above scenario, there could be various reasons or situations where these issues of the skip of the Identity value will be seen, which will be small in number and random. However, when there is an issue of Identity value jump, the amount of the skip value will be prominent and precise in number, depending upon the column data type, which is easy to identify.

My advice is, don’t disable this unless you have a really good reason to. “I don’t want to see numbers skipped” is a bad reason because identity columns never guarantee sequential insertion. You can still have skips due to transaction rollbacks or other errors on insertion, and the identity cache helps with performance problems in cases with lots of writers inserting rows into the table.

Comments closed