Press "Enter" to skip to content

Creating Identity Columns in Databricks

Franco Patano generates some identity integers:

Identity columns solve the issues mentioned above and provide a simple, performant solution for generating surrogate keys. Delta Lake is the first data lake protocol to enable identity columns for surrogate key generation.

Delta Lake now supports creating IDENTITY columns that can automatically generate unique, auto-incrementing ID numbers when new rows are loaded. While these ID numbers may not be consecutive, Delta makes the best effort to keep the gap as small as possible. You can use this feature to create surrogate keys for your data warehousing workloads easily.

This is a bit light on explanation, unfortunately. With distributed systems, generating identities is historically tricky (especially with several independent nodes generating values) so I’d be curious to see how it works: do they allocate blocks of IDs to worker nodes or do something else? And are the IDs guaranteed to be monotonically increasing? Or is there some other service which “labels” the data upon insert and provides those IDs?