Press "Enter" to skip to content

String Regularization and Tokenization in SQL Server

Aaron Bertrand saves some space:

The Stack Exchange network logs a lot of web traffic – even compressed, we average well over a terabyte per month. And that is just a summarized cross-section of our overall raw log data, which we load into a database for downstream security and analytical purposes. Every month has its own table, allowing for partitioning-like sliding windows and selective indexes without the additional restrictions and management overhead. (Taryn Pratt talks about these tables in great detail in her post, Migrating a 40TB SQL Server Database.)

It’s no surprise that our log data is massive, but could it be smaller? Let’s take a look at a few typical rows. While these are not all of the columns or the exact column names, they should give an idea why 50 million visitors a month on Stack Overflow alone can add up quickly and punish our storage:

Click through for one technique Aaron has to tighten things up a bit.