Category: Data Types

Choosing between Data Types

Published 2024-09-03 by Kevin Feasel

An entire post on choosing the correct data types either seems like overkill or much needed and overdue. The perspective might vary based on the databases you’ve worked with recently. I decided to write this after seeing some code with data type decisions that I would classify as questionable. There are many decisions in technology that can be ambiguous, but the correct date type should be based on business rules and a set of technical guidelines. I am going to share my thought process for deciphering the correct type here.

Selecting a data type is an important part of database and table design. The column represents an actual business attribute, is used to support data integrity, or is used for performance considerations. Care should be used when selecting the definition for each column. Choosing the wrong type can impact each of these areas, makes the system difficult to work with, and makes integrations harder than necessary.

Read on for Ben’s selection criteria and further thoughts. Most of it I find quite sensible.

I do, however, strongly disagree on part of Ben’s Unicode character strings recommendation, as I am an NVARCHAR Everywhere kind of guy. My counter-recommendation is always to use NVARCHAR over VARCHAR (I tend to be a bit more flexible about NCHAR vs CHAR, as those are typically for flags versus user-relevant data), and use row-level or page-level compression on indexes wherever it makes sense. If you don’t have any characters outside of your codepage in any row of that data, the size will be the same as with VARCHAR. If you do have the need for special characters, you don’t need to rebuild everything from scratch as part of a half-year (or longer) internationalization program. And if you’re consistent about always using NVARCHAR, you also eliminate implicit conversion risk.

Comments closed

Power BI Data Type Optimization

Published 2024-08-20 by Kevin Feasel

Nikola Ilic shows how important it can be to choose the right data types:

For demo purposes, I’ll be using a fact table that contains the data about chats performed by a customer support department of the fictitious company Customer First. This table includes approximately 9 million rows, which is not considered a large table in the context of Power BI and analytical workloads. For the sake of simplicity, let’s pretend that our model consists of only this single table. Finally, a semantic model is configured as an Import mode model. If you want to learn how your data is stored in Power BI, I suggest you start by reading this article first.

Data was loaded into Power BI from the underlying data source (SQL Server database) as-is, without any additional optimizations applied.

Nikola walks through the process of finding the most expensive columns in terms of data size and using the least precise acceptable value. One other thing that I commonly see is identity columns or other keys on fact tables. Those are very rarely necessary, because the point of a fact table is typically to aggregate it in some fashion. And these keys are unique (by design), meaning they won’t compress very well and will take up a lot of space. Looking at Nikola’s example, my next question would be, knowing that the name of the table is factChat, does chatID tie to some chat dimension? If not, is it actually necessary for reporting? Again, if not, that could shave off another 60 MB or so from the data model.

Comments closed

Changing the Data Type of a PostgreSQL Column Used in a View

Published 2024-08-16 by Kevin Feasel

Andrea Gnemmi makes a change:

We have all encountered the need to change a column data type in a table. Unfortunately, in PostgreSQL, you cannot change it in a single step if the column is used in a view. In this article, we cover the steps to change the data type for a column in a PostgreSQL table.

Read on for those steps. This is a bit more complex than SQL Server, at least for non-materialized views. In that case, the view is little more than a saved T-SQL query, so you would want to run sp_refresh [view_schema].[view_name] to ensure that the view has the latest metadata and doesn’t fail when you run the query.

Comments closed

Avoid Mixing DATETIME with other Date Types

Published 2024-08-12 by Kevin Feasel

Paul White shares some advice:

Microsoft encourages us not to use the datetime data type:

Avoid using datetime for new work. Instead, use the time, date, datetime2, and datetimeoffset data types. These types align with the SQL Standard, and are more portable. time, datetime2 and datetimeoffset provide more seconds precision. datetimeoffset provides time zone support for globally deployed applications.

Well, ok. Sensible and well-informed people might still choose to use datetime for performance reasons. Common date and time functions have optimised implementations in the SQL Server expression service for the datetime and smalldatetime data types.

Paul has posted the full article on X.

Comments closed

The Internals of DATETIME2

Published 2024-08-09 by Kevin Feasel

Chad Baldwin digs in:

I noticed in sys.column_store_segments the min_data_id and max_data_id columns store very large bigint values in the segments for datetime2 columns. After doing a bit more googling and tinkering, I found for bit/tinyint/smallint/int/bigint it stores the min/max of the actual values rather than dictionary lookup values. So I assume it’s likely doing the same for date/time/datetime/datetime2 and storing some sort of bigint representation of the actual value.

This post is going to focus on datetime2(7) datatypes mainly because that’s what I was dealing with. Though I’m sure it wouldn’t be much work to figure out the other types.

Click through to learn more about the datatype and see how this wraps into a discussion of temporal table cleanup and columnstore indexes.

Comments closed

Reading Pipe-Delimited Files in SSIS

Published 2024-07-19 by Kevin Feasel

Greg Low does a bit of parsing:

There was a question on the Q&A forums today, asking how to read data using SSIS, when it’s in this format:

|Col1| |Col2|Col3|Col| |101| |A|21|DC|

One of the concerns was that there was a leading pipe. This is not a problem. When you have data like that, and you set | as the delimiter, because there are 6 delimiters, then there are 7 columns output.

Read on for Greg’s answer. This is a big part of why I recommend people to use Parquet (or ORC, if you’re an iconoclast like I am): not needing to deal with whatever silly, arbitrary file formats people come up with.

Comments closed

Removing Leading Zeroes from a String in T-SQL

Published 2024-07-12 by Kevin Feasel

Steve Stedman gets rid of leading zeroes:

When working with data in SQL Server, there may be times when you need to remove leading zeros from a string. This task can be particularly common when dealing with numerical data stored as strings, such as ZIP codes, product codes, or other formatted numbers. In this blog post, we’ll explore several methods to remove leading zeros in SQL Server.

I’m not sure I see the reason to use anything other than CAST() (or, better yet, TRY_CAST()), but Steve does show two other methods.

2 Comments

Bit Column Order and Data Length

Published 2024-07-03 by Kevin Feasel

Brent Ozar performs an experiment:

At the PGConf.dev, where Postgres developers get together and strategize the work they wanna do for the next version, I attended a session where Matthias van de Meent talked about changing the way Postgres stores columns. As of right now (Postgres 17), columns are aligned in 8-bit intervals, so if you create a table with alternating columns:

Read on to see an example, and then Brent performs a test to see how SQL Server handles this scenario. The comments also mention that at least older versions of Oracle behaved like Postgres.

Comments closed

Working with XML in SQL Server

Published 2024-06-05 by Kevin Feasel

Ed Pollack talks XML:

XML is a common storage format for data, metadata, parameters, or other semi-structured data. Because of this, it often finds its way into SQL Server databases and needs to be managed alongside other data types.

Even though a relational database is not the optimal place to store and manage XML data, it is often needed due to application requirements, convenience, or a need to maintain this information in close proximity to other app data.

This article dives into a variety of common XML challenges and the functionality included in SQL Server to help make managing them as simple as possible.

Ed does a good job of walking through what you can do. My general philosophy on XML and JSON in the database is simple: if you simply want a place to store some JSON or XML outputs and retrieve the results exactly as they are without performing any searches or transformations, write as JSON/XML. If you want to use the database to search through JSON/XML records for particular attributes and values, or if you want to reshape the JSON/XML data within the database, create a proper data model for this input.

Comments closed

Uniqueidentifier Ordering in SQL Server

Published 2024-05-31 by Kevin Feasel

Jose Manuel Jurado Diaz clears up the mystery:

Today, I worked on a service request that our customer asked about how SQL Server sorts the uniqueidentifier data type. We know that uniqueidentifier store globally unique identifiers (GUIDs). GUIDs are widely used for unique keys due to their extremely low probability of duplication. One common method to generate a GUID in SQL Server is by using the NEWID() function. However, the ordering of GUIDs, especially those generated by NEWID(), can appear non-intuitive. I would like to share my lessons learned how to determine the ordering method using uniqueidentifier and NEWID().

Yeah, things get pretty weird because x86-x64 processors (Intel and AMD both) are Little Endian, and only the fourth chunk of a GUID is Big Endian.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31