Press "Enter" to skip to content

Curated SQL Posts

Benefits Of Explicit Transactions

Kendra Little talks about explicit transactions and when they’re useful for single-statement operations:

If you do not enable implicit transactions, and you don’t start an explicit transaction, you are in the default “autocommit” mode.

This mode means that individual statements are automatically committed or rolled back as whole units. You can’t end up in a place where only half your statement is committed.

Our question is really about whether there are unseen problems with this default mode of autocommit for single-statement units of work.

By force of habit, I wrap data modification operations in an explicit transaction.  They let me test my changes before committing and the time you’re most likely to spot an error seems to be right after hitting F5.

Comments closed

Read-Only Databases And Single-User Mode

David Fowler notes an old bug in SQL Server 2012 and 2014 which bit him recently:

Here’s a strange one that I’ve recently come across.  I had a customer report that their log shipping restore jobs were chock a block of errors.  Now, the logs seem to have been restoring just fine but before every restore attempt, the job is reporting the error,

Error: Failed to update database “DATABASE NAME” because the database is read-only.

Unfortunately I haven’t got any direct access to the server but their logshipping is setup to disconnect users before and leave the database in standby after.  After a bit of to-ing and fro-ing, I asked the customer to send me a trace file covering the period that the restore job ran.

Read on for the details and keep those servers patched.

Comments closed

Classifying Data In SSMS

Steve Jones gives SQL Server Management Studio 17.5 a spin and tries to classify some data:

There’s a getting started link, which takes me to the SQL Server Security Blog. I suspect that’s an incorrect link. I think it should go here: SQL Data Discovery and Classification.

Below this, I see a list of the recommendations. This has grabbed tables that appear to continue to contain some data that might be sensitive and require classification. One of the tenets of the GDPR  is that you know your data. You aren’t allowed to figure this out later, but rather you must proactively know what data you are collecting and processing.

It’s a good overview of the feature.  Like Steve mentions, I appreciate this data being stored as extended properties:  that way, third party and custom-built tools can make use of it.  You can also script them out for migration.

Comments closed

Discovering Composite Keys

John Morehouse shares some good information on composite keys, including a few scripts:

As I started to work on this, my first thought was that it would be helpful to know how many tables had a composite primary key.  This would give me an idea on how many tables I was dealing with.  Thankfully, SQL Server has this information by using system DMVs (dynamic management views) along with the COL_NAME function.

Note: the COL_NAME function will only work with SQL Server 2008 and newer.  

All of this time, I’d never known about COL_NAME.

Comments closed

Securing KSQL

Yeva Byzek shows the methods available to secure a Kafka Streams application:

To connect to a secured Kafka cluster, Kafka client applications need to provide their security credentials. In the same way, we configure KSQL such that the KSQL servers are authenticated and authorized, and data communication is encrypted when communicating with the Kafka cluster. We can configure KSQL for:

Read the whole thing if you’re thinking about using Kafka Streams.

Comments closed

Deploying Jupyter Notebooks

Teja Srivastasa has an example of deploying a Jupyter notebook for production use on AWS:

No one can deny how large the online support community for data science is. Today, it’s possible to teach yourself Python and other programming languages in a matter of weeks. And if you’re ever in doubt, there’s a StackOverflow thread or something similar waiting to give you the perfect piece of code to help you.

But when it came to pushing it to production, we found very little documentation online. Most data scientists seem to work on Python notebooks in a silo. They process large volumes of data and analyze it — but within the confines of Jupyter Notebooks. And most of the resources we’ve found while growing as data scientists revolve around Jupyter Notebooks.

Another option might be to use JupyterHub.

Comments closed

Reviewing The Team Data Science Process

I am starting a new series on launching a data science project, and my presentation quickly veers into a pessimistic place:

The concept of “clean” data is appealing to us—I have a talk on the topic and spend more time than I’m willing to admit trying to clean up data.  But the truth is that, in a real-world production scenario, we will never have truly clean data.  Whenever there is the possibility of human interaction, there is the chance of mistyping, misunderstanding, or misclicking, each of which can introduce invalid results.  Sometimes we can see these results—like if we allow free-form fields and let people type in whatever they desire—but other times, the error is a bit more pernicious, like an extra 0 at the end of a line or a 10-key operator striking 4 instead of 7.

Even with fully automated processes, we still run the risk of dirty data:  sensors have error ranges, packets can get dropped or sent out of order, and services fail for a variety of reasons.  Each of these can negatively impact your data, leaving you with invalid entries.

Read on for a few more adages which shape the way we work on projects, followed by an overview of the Microsoft Team Data Science Process.

Comments closed

Enabling Optimizer Fixes In SQL Server

Monica Rathbun explains that just upgrading a SQL Server database doesn’t enable optimizer changes:

When applying a new SQL Server cumulative update, hot fix, or upgrade SQL Server doesn’t always apply all the fixes in the patch. When you upgrade the database engine in-place, databases you had already stay at their pre-upgrade compatibility level, which means they run under the older set of optimizer rules. Additionally, many optimizer fixes are not turned on. The reason for this is that while they may improve overall query performance, they may have negative impact to some queries. Microsoft actively avoids making breaking changes to its software.

To avoid any negative performance impacts, Microsoft has hidden optimizer fixes behind a trace flag, giving admins the option to enable or disable the updated fixes. To take advantage of optimizer fixes or improvements you would have enable trace flag 4199 after applying each hot fix or update or set it up as a startup parameter. Did you know this? This was something I learned while working with an existing system, years into my career. I honestly assumed it would just apply any applicable changes that were in the patch to my system. Trace flag 4199 was introduced in the SQL Server 2005-era. In SQL Server 2014, when Microsoft made changes to the cardinality estimator they protected the changes with trace flags as well, giving you the option to run under compatibility level 120 and not have the cardinality estimator changes in effect.

Things changed starting with SQL Server 2016.

Click through to see how SQL Server 2016 made it a bit easier.

Comments closed

Log Shipping Tests With dbachecks

Sander Stad has a bonus post in his log shipping series:

We want everyone to know about this module. Chrissy LeMaire reached out to me and asked if I could write some tests for the log shipping part and I did.

Because I wrote the log shipping commands for dbatools I was excited about creating a test that could be implemented into this module for everyone to use.

That test is also quite easy to use, as Sander demonstrates.

Comments closed

Row Goals And Semi Joins

Paul White continues his row goals series:

The remaining physical join type is nested loops, which comes in two flavours: regular (uncorrelated) nested loops and apply nested loops (sometimes also referred to as a correlated or lateral join).

Regular nested loops join is similar to hash and merge join in that the join predicate is evaluated at the join. As before, this means there is no value in setting a row goal on either input. The left (upper) input will always be fully consumed eventually, and the inner input has no way to determine which row(s) should be prioritized, since we cannot know if a row will join or not until the predicate is tested at the join.

By contrast, an apply nested loops join has one or more outer references (correlated parameters) at the join, with the join predicate pushed down the inner (lower) side of the join. This creates an opportunity for the useful application of a row goal. Recall that a semi join only requires us to check for the existence of a row on join input B that matches the current row on join input A (thinking just about nested loops join strategies now).

In other words, on each iteration of an apply, we can stop looking at input B as soon as the first match is found, using the pushed-down join predicate. This is exactly the sort of thing a row goal is good for: generating part of a plan optimized to return the first n matching rows quickly (where n = 1 here).

This has the depth and quality that you naturally expect from Paul, making it an immediate read.

Comments closed