Sometimes we get lucky comparing a literal value to a column of a different type.
But this is very complicated, and joining on two columns of different types in the same family without explicitly converting the type of one of the columns resulted in worse performance in Paul White’s tests, when the columns allowed NULLs! (Note: I haven’t rerun those tests on 2016, but I think the general advice below still applies.)
General advice: don’t rely on being lucky. Pay attention to your data types, and compare values of the same data type wherever possible.
That’s great advice.
I half-stumbled on the weirdness around SQL_VARIANT a while back while writing another post about implicit conversion. What I didn’t get into at the time is that it can give you incorrect results.
When I see people using SQL_VARIANT, it’s often in dynamic SQL, when they don’t know what someone will pass in, or what they’ll compare it to. One issue with that is you’ll have to enclose most things in single quotes, in case a string or date is passed in. Ever try to hand SQL those without quotes? Bad news bears. You don’t get very far.
Read on for the demo. I have never used SQL_VARIANT in any project. I’ve done a lot of crazy things with SQL Server (some of them intentionally) but never this.
Notice that OrderYear displays decimal points. I switched the dataset in the Series field name property, and found that neither of the columns in the dataset can be used.
Numeric columns cannot be set as a Series name field. To work around this, I modified the dataset, casting OrderYear as a CHAR(4).
That’s not a great situation, but at least there’s a workaround.
Yeah, there you go, all these _WA_Sys_ stats tell me they have been automatically created (there is a flag in sys.stats if you don’t believe me) but I can see there are only 31, where I created 34 columns.
That’s funny, let’s see which data types did get statistics.
The results are pretty interesting.
Let’s take a look at what is being asked here. Using the 32-bit integer as an example, we currently have a data type that can accept a range between negative two billion and two billion. But if negative numbers aren’t required, we can use those same 32 bits to store numbers between zero and four billion. Why, goes the question, throw away that perfectly useful upper range by always reserving space for a negative range we may not need?
I appreciate Ewald’s thoughtfulness here in working out the value of the request as well as some of the difficulties in building something which fulfills his desire. Great read.
Because binary values are essentially strings, they easily convert to and from character strings, using CAST or CONVERT. To convert the binary value of 0x41 to a plain-text character value, trySELECT CAST(0x41 AS char(1)); --- 'A'
The binary value 0x41 is equivalent to decimal 65, and CHAR(65) is the letter “A”. Note that I haven’t placed any quotes around 0x41 – that’s because it’s a numeric value (albeit in hex notation) and not a string.
A couple use cases I’ve seen are creating hashes (SHA1 or MD5) for change detection, storing password hashes, and encrypted columns—Always Encrypted uses varbinary data types to store encrypted information, for example.
So at work, I’d say space matters, and in order to optimize our storage requirements it’s very important to know about SQL Server internals, specially the Storage Engine, which happens to be one of my favorite topics of study.
In my quest to release some space I got to this database, just one table which is 165M of XML documents stored as NVARCHAR(MAX).
It was interesting walking through the process. Some part of me wonders if it’s a bit complex for the next maintainer to handle, but saving a terabyte of disk space is worth a few extra pages of documentation…
SQL Server makes many good and successful attempts at something called predicate pushdown, or predicate pushing. This is where certain filter conditions are applied directly to the data access operation. It can sometimes prevent reading all the rows in a table, depending on index structure and if you’re searching on an equality vs. a range, or something else.
What it’s really good for is limiting data movement. When rows are filtered at access time, you avoid needing to pass them all to a separate operator in order to reduce them to the rows you’re actually interested in. Fun for you! Be extra cautious of filter operators happening really late in your execution plans.
Click through for Erik’s demo.
There is also considerable overhead in terms of memory usage. Every non-empty off-row value adds 50+ bytes of the overhead regardless of its size. Those 50+ bytes consist of three artificial ID values (in-row, off-row in data row and leaf-level of the range index) and off-row data row structure. It is even larger in case of LOB columns where data is stored in LOB Page Allocator.
One of the key points to remember that decision which columns go off-row is made based on the table schema. This is very different from on-disk tables, where such decision is made on per-row basis and depends on the data row size. With on-disk tables, data is stored in row when it fits on the data page.
In-Memory OLTP works in the different way. (Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This is a great article getting into the internals of how memory-optimized tables work in SQL Server 2016, as well as a solid reason to avoid LOB types and and very large VARCHAR values on memory-optimized tables if you can. Absolutely worth a read.
Sometimes you just can’t take the outage. In that case, you’ve got to proceed with your own wits, and your own code. This is tricky because changes are occurring to the table.
The solution typically looks like this:
Set up a way to track changes to the table – either triggers that duplicate off modifications or Change Data Capture (Enterprise Edition)
Create the new table with the new data type, set identity_insert on if needed
Insert data into the new table. This is typically done in small batches, so that you don’t overwhelm the log or impact performance too much. You may use a snapshot from the point at which you started tracking changes.
Start applying changed data to the new table
Make sure you’re cleaning up from the changed data you’re catching and not running out of space
Write scripts to compare data between the old and new tables to make sure you’re really in sync (possibly use a snapshot or a restored backup to compare a still point in time)
Cut over in a quick downtime at some point using renames, schema transfer, etc. If it’s an identity column, don’t forget to fix that up properly.
This method matches what I’ve done in zero downtime situations.
In part 3 of this series, I showed two workarounds to avoid widening an
IDENTITYcolumn – one that simply buys you time, and another that abandons
IDENTITYaltogether. The former prevents you from having to deal with external dependencies such as foreign keys, but the latter still doesn’t address that issue. In this post, I wanted to detail the approach I would take if I absolutely needed to move to
bigint, needed to minimize downtime, and had plenty of time for planning.
Because of all of the potential blockers and the need for minimal disruption, the approach might be seen as a little complex, and it only becomes more so if additional exotic features are being used (say, partitioning, In-Memory OLTP, or replication).
At a very high level, the approach is to create a set of shadow tables, where all the inserts are directed to a new copy of the table (with the larger data type), and the existence of the two sets of tables is as transparent as possible to the application and its users.
Those are two good posts on this topic.