Press "Enter" to skip to content

Tips For Processing Large Data Sets With Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.