Chen Hirsh deals with truly big data:
Excel is one of the most common data file formats, and, as data engineers, we are required to read data from it on almost every project. Excel is easy to use, and you can customize it quickly, like adding a column and changing data. But the same things that made it the go-to format for users, make it hard to read by Data platforms. Adding a column might break a pipeline, and changing datatypes, for example, adding text to a column that only held numeric data before, might cause a nasty error downstream.
Working in Databricks, you can read and write Excel files, but you need to pay attention to some pitfalls. So let’s get started, working with Excel files on Databricks!
Click through for a way to do this using PySpark. H/T Madeira Data Solutions blog.
Comments closed