Tips For Processing Large Data Sets With Python

Kevin Feasel

2018-03-30

Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.

Related Posts

Building TensorFlow Neural Networks On Spark With Keras

Jules Damji has an example of using the PyCharm IDE to use Keras to build TensorFlow neural network models on the Databricks MLflow library: Our example in the video is a simple Keras network, modified from Keras Model Examples, that creates a simple multi-layer binary classification model with a couple of hidden and dropout layers and […]

Read More

Executing ML Services Scripts From Jupyter Notebooks

Kyle Weller has an inception moment with Python and SQL Server Machine Learning Services: While this example is trivial with the Iris dataset, imagine the additional scale, performance, and security capabilities that you now unlocked. You can use any of the latest open source R/Python packages to build Deep Learning and AI applications on large […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031