Brent Ozar lays out an argument:
You’re looking at a wide table with 100-200 columns.
Years ago, it started as a “normal” table with maybe 10-20, but over the years, people kept gradually adding one column after another. Now, this behemoth is causing you problems because:
- The table’s size on disk is huge
- Queries are slow, especially when they do table scans
- People are still asking for more columns, and you feel guilty saying yes
You’ve started to think about vertical partitioning: splitting the table up into one table with the commonly used columns, and another table with the rarely used columns. You figure you’ll only join to the rarely-used table when you need data from it.
Read on to understand why this is rarely a good idea and what you can do instead.
I will say that I’ve had success with vertical partitioning in very specific circumstances:
- There are large columns, like blobs of JSON, binary data, or very large text strings.
- There exists a subset of columns the application (or caller) rarely needs.
- Those large columns are in the subset of columns the caller rarely needs, or can access via point lookup.
For a concrete example, my team at a prior company worked on a product that performed demand forecasting on approximately 10 million products across the customer base. For each product, we had the choice between using a common model (if the sales fit a common pattern) or generating a unique model. Because we were using SQL Server Machine Learning Services, we needed to store those custom models in the database. But each model, even when compressed, could run in the kilobytes to megabytes in size. We only needed to retrieve the model during training or inference, not reporting, but we did have reports that tracked what kind of model we were using, whether it was a standard model or custom (and if custom, what algorithm we used). Thus, we had the model binary in its own table, separate from the remaining model data.