Tips For Processing Large Data Sets With Python

Kevin Feasel

2018-03-30

Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.

Related Posts

A Quick Keras Example

Shubham Dangare takes us through a quick example using Keras and TensorFlow in Python: Keras is a high-level neural networks API, written in Python and capable of running on top of Tensorflow, CNTK  or Theano. It was developed with a focus on enabling fast experimentation. In this blog, we are going to cover one small […]

Read More

ML Services and Injectable Code

Grant Fritchey looks at sp_execute_external_script for potential SQL injection vulnerabilities: The sharp eyed will see that the data set is defined by SQL. So, does that suffer from injection attacks? Short answer is no. If there was more than one result set within the Python code, it’s going to error out. So you’re protected there. […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031