Spark Schema Inference in Production

Miles Cole shares some advice:

To show the impact I want to highlight a benchmark that included Fabric Spark on a single 19GB CSV input file (100M Contoso dataset, sales table) for the benchmark. While there were a number of issue with this benchmark that inadvertently make Spark appear to be slow, this is only focused on the impact of inferring schema and practical recommendations.

Read on to see a performance problem that schema inference brings up. I’d also want to mention the risk of data updates blowing up your well-laid plans as a risk. Schema inference is a double-edged sword: it can be convenient and open up new approaches to development, but can just as easily cause unexpected failures.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28