Press "Enter" to skip to content

Category: Data Types

Sparse Columns in SQL Server

Chad Callihan occasionally inserts something:

Have you ever maxed out the SQL Server table column limit yet still needed more columns? Hopefully not considering SQL Server has a max limit of 1024 columns per table. But as I found out, it’s possible for someone to reach out and ask for even more. Sparse columns are an option to consider when you can’t get enough. Let’s take a look at what sparse columns are and how they can be used.

Sparse columns have very little utility, except in the most “I don’t think you’re doing it right” scenarios. Still, if you happen to end up in that scenario, there is a way out, though I’d really want to understand the nature of the data in that problem and, knowing just the amount of detail in the scenario that I do, would lean toward storing the data either in an unpivoted fashion (one row per entity * attribute in an EAV-style “‘additional attributes” table) or as a JSON string and let the client sort it out.

Comments closed

Adding Microseconds to ADF Timestamps

Rayis Imayev can’t wait for the next second:

The current addToTime function (https://learn.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions#addToTime) in Azure Data Factory (ADF) only supports a specific set of time units ranging from Year to Seconds. Since I needed to increment a timestamp by microseconds, I had to find an alternative solution in ADF. Here are my findings on how to do this using an alternate approach.

Click through for Rayis’s solution to the problem.

Comments closed

Analyzing SQL Server Table Metadata

Barney Lawrence looks at details about a table:

For a while I’ve been building up a script that uses SQL Servers INFORMACTION_SCHEMA.COLUMNS table to create a query that applies aggregates to each column in a table and then formats the results to allow an at a glance summary of the shape of the data inside of it.

I showed some of the techniques used to build this in a session regarding metadata and data warehousing which I co-presented with Emma Dolling and Ruth Pearson at SQLBits. The one comment that I heard the most after this session was “I want that script”. I did promise to make it available more than a few times and so I’ve put together a github repository to hold it and other resources we built for the session.

Barney anticipated the most likely change I’d make while reviewing the script: APPROX_COUNT_DISTINCT() if you’re on SQL Server 2022 and dealing with a large table.

Comments closed

A Critique of XML

Andy Leonard isn’t XML’s biggest fan:

If you are sending me (or some other hapless victim data engineer) lots of data that resides in a stable schema – one in which the number, order, data type, etc. of the columns never change – using XML, I have a question:

Why?

Why are you using XML to transmit this data?

Read the whole thing. My approximate thoughts (because it is fairly early when I’m writing this, so I might have missed something) are:

  1. XML is most useful with an XSLT, a document describing the shape and rules of the XML data. This is a big advantage over CSV, as it helps you retain information on data types, data lengths, and other details which get lost in the comma.
  2. Speaking of which, CSVs run a high risk of needing to use the separator as a native character. The problem is that there is no single right way to indicate that “That comma is a separator, but this comma is just a comma.” Different parsers work differently, and one of my lengthy rants about PolyBase is that it helpfully indicates that you have a quoted delimiter here and helpfully removes it before barfing on the commas inside quotations. There is actually an ANSI standard character for separator which is not supposed to occur in the wild…but how many people actually use it? Especially considering that most tools don’t interpret it correctly, so you lose some of the readability of CSVs in the process.
  3. That said, for stable schemas with a known separator (or at least a known mechanism for differentiating separators from naturally occurring characters), separated values works well.
  4. And that said, Parquet works better, assuming you don’t have a lot of long string columns. If you’re dealing mostly with numeric data, Parquet will compress much better, will retain data types and lengths, and won’t be a repetitious blob of angle brackets. But a lot of tools still don’t support Parquet, which is a downside.
  5. Basically, this is why we can’t have nice things.
1 Comment

Data Types and CONCAT_WS

Koen Verbeeck hits a concatenation issue:

I was writing some dynamic SQL that generates some SQL statements to load my facts and dimensions into a data warehouse. Some of those SQL statements can become very long, for example if a dimension has a lot of columns. When debugging, I noticed a couple of statements failing with various errors. Turns out, they were truncated after 4000 characters. What was going on?

Read on to see what happened.

Comments closed

Comparative Data Type Efficiency in Power BI

Soheil Bakhshi doesn’t want everything in Power BI to be decimals and strings:

In my consulting experience working with customers using Power BI, many challenges that Power BI developers face are due to negligence to data types. Here are some common challenges that are the direct or indirect results of inappropriate data types and data type conversion:

  • Getting incorrect results while all calculations in your data model are correct.
  • Poor performing data model.
  • Bloated model size.
  • Difficulties in configuring user-defined aggregations (agg awareness).
  • Difficulties in setting up incremental data refresh.
  • Getting blank visuals after the first data refresh in Power BI service.

Read on for a dive into these problems.

Comments closed

Another Problem with Nullable Columns

Erik Darling tells the unpleasant truth about NULL:

Table definitions have a similar effect on developers. In today’s post, I’m going to use temp tables as an example, but the same thing can happen with regular tables, too.

The issue isn’t with NULL values themselves, of course. The table definition  we’re going to use will allow NULLs, but no NULLs will be present in the data.

The issue is with how you query NULLable columns, even when no NULLs are present.

As a card-carrying member of the League of No-Null Workers, I am happy to tell you all about how much I hate NULL in databases. And this isn’t even the big reason.

Comments closed

Dynamic SQL and String Data Types

Erik Darling theorizes about strings:

If you write the good kind of dynamic SQL, that is:

1. Parameterized

2. Executed with sp_executesql

You’ll probably have run into some silly-ish errors in the practice. Namely, that sp_executesql expects your SQL string and your Parameter string to be NVARCHAR(…).

Read on for some thoughts on data types, max-ness, and the like.

Comments closed

Conversion Precedence in SQL Server

Kenneth Fisher sorts out data types:

With the expectation that you would get back a decimal. Did you wonder why it converted the integers to decimals rather than the other way around? If not you probably should have. I’ll give you a hint. It’s not because it’s the first value in the equation, although that does make a difference.

Thought about it? In case you didn’t figure it out T-SQL (and again, I’m going to say all programing languages where implicit conversions are allowed) has a conversion precedence order. In other words in any calculation where an implicit conversion is required the datatype with a value higher on the list is converted to the datatype with the lower value on the list.

The parenthetical statement above is correct. This makes things really convenient until you have an implicit conversion you weren’t expecting. There’s an advantage to explicit-only languages, though those do involve more typing.

Comments closed

Data Type Conversions in SQL Server

Kenneth Fisher talks types:

Did you know that 1/2=0? If you didn’t you could be in for some interesting surprises.

1/2 returns 0 because 1 and 2 are both integers. When you divide two integers you get back an integer. Now, we, as human beings, know that what you probably wanted was to get back 0.5. Unfortunately, computers really aren’t all that smart and we have to be very explicit with them. 

Read on for more details, as well as a chart of which conversions may be implicit and which must be explicit.

Comments closed