For file types that don’t contain there own metadata (CSV, Text etc) we typically have to go and figure out there structure including; attributes and data types before doing any actual transformation work. Often I’ve used the Data Factory Metadata Activity to do this with its structure option. However, while playing around with Azure Synapse Analytics, specifically creating Notebooks in C# to run against the Apache Spark compute pools I’ve discovered in most case the Data Frame infer schema option basically does a better job here.
Now, I’m sure some Spark people will probably read the above and think, well der, obviously Paul! Spark is better than Data Factory. And sure, I accept for this specific situation it certainly is. I’m simply calling that out as it might not be obvious to everyone
Read on for a comparison of the two techniques.