R Internals: Data Sizes With Nullable Columns

Niels Berglund digs into the Binary Exchange Langage (BXL) and notices something weird about data sizes:

When looking at the data sent, the size of the packages and “drilling” into the TCP packets we could deduct that: :

  • Each column has an over-head of 32 bytes (at least for non nullable data)

  • The size of the column in one row is the size of the data type for numeric types.

  • For decimal and numeric an extra byte is added to each column, where this byte indicates the precision.

  • Columns of alpha numeric type all had 2 bytes pre-pended to the bytes, except max types.

  • For char and nchar the storage size was 2 bytes plus the size the column was defined as.

  • For varchar and nvarchar the storage size was 2 bytes plus the size of the data stored.

  • For the varmax data types the number of bytes that were pre-pended varied dependent on the data size.

Read the whole thing.

Related Posts

Diving Into Clustered Index Seeks

Hugo Kornelis looks in detail at the clustered index seek: The Clustered Index Seek operator uses the structure of a clustered index to efficiently find either single rows (singleton seek) or specific subsets of rows (range seek). Because a clustered index always contains all columns in a table, a Clustered Index Seek is one of the most […]

Read More

Biases in Tree-Based Models

Nina Zumel looks at tree-based ensembling models like random forest and gradient boost and shows that they can be biased: In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data […]

Read More

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930