Press "Enter" to skip to content

Stack Overflow DB, Columnstore Edition

Erik Darling has started a new series. Part one is the intro:

I really wanted a version of the Stack Overflow data dump that was all clustered column store. In SQL Server 2016, that didn’t really work because of restrictions around MAX data types. In 2017 it did, but… If your data warehouse has a bunch of max data type columns, that’s bad and you should feel bad.

The problem here is that once you drop out the “big” columns (AboutMe from Users, Text from Comments, Body from Posts), the entire ~300GB database compressed down to about 6GB. That means if we want a realistically sized data warehouse, we’d need a “Make Big” script, like people used to use for Adventure Works before it went out of business.

Part 2, like a noble spirit, embiggens the smallest man:

One thing I’d love feedback on is advancing dates. Right now, the script doesn’t do that at all. I thought of different ways to handle it, but didn’t like any of them, mostly because of how it might mess with the partitioning function I’m using. I felt like I was overthinking it quite a bit, and decided to leave dates as-is, and only increment User and Post Ids.

A quick note: This script assumes that a database called StackOverflow will be the source of the loads. If you need to use a different version, that’s a manual change. I didn’t want to go down the dynamic SQL route here until I gauged popularity.

If you want to play along at home, you can grab the data dump from archive.org.