If you are sending me (or some other hapless victim data engineer) lots of data that resides in a stable schema – one in which the number, order, data type, etc. of the columns never change – using XML, I have a question:
Why are you using XML to transmit this data?
Read the whole thing. My approximate thoughts (because it is fairly early when I’m writing this, so I might have missed something) are:
- XML is most useful with an XSLT, a document describing the shape and rules of the XML data. This is a big advantage over CSV, as it helps you retain information on data types, data lengths, and other details which get lost in the comma.
- Speaking of which, CSVs run a high risk of needing to use the separator as a native character. The problem is that there is no single right way to indicate that “That comma is a separator, but this comma is just a comma.” Different parsers work differently, and one of my lengthy rants about PolyBase is that it helpfully indicates that you have a quoted delimiter here and helpfully removes it before barfing on the commas inside quotations. There is actually an ANSI standard character for separator which is not supposed to occur in the wild…but how many people actually use it? Especially considering that most tools don’t interpret it correctly, so you lose some of the readability of CSVs in the process.
- That said, for stable schemas with a known separator (or at least a known mechanism for differentiating separators from naturally occurring characters), separated values works well.
- And that said, Parquet works better, assuming you don’t have a lot of long string columns. If you’re dealing mostly with numeric data, Parquet will compress much better, will retain data types and lengths, and won’t be a repetitious blob of angle brackets. But a lot of tools still don’t support Parquet, which is a downside.
- Basically, this is why we can’t have nice things.