Press "Enter" to skip to content

When to Start Using a Database with R or Python

Roel Hogervorst thinks about data sizes in R and Python:

Your dataset becomes so big and unwieldy that operations take a long time. How long is too long? That depends on you, I get annoyed if I don’ t get feedback within 20 seconds (and I love it when a program shows me a progress bar at that point, at least I know how long it will take!), your boundary may lay at some other point. When you reach that point of annoyance or point of no longer being able to do your work. You should improve your workflow.

I will show you how to do some speedups by using other R packages, in python moving from pandas to polars, or leveraging databases. I see some hesitancy about moving to a database for analytical work, and that is too bad. Bad for two reasons, one: it is super simple, two it will save you a lot of time.

I definitely agree with Roel’s bottom line here. Granted, part of that is domain knowledge, but databases are extremely good at handling data and both languages have plenty of database accessibility.

One last tip, though: if you’re on the data science or data analytics track, learn SQL. Yes, libraries like dbplyr in R or ORMs in Python can cover up a lot, but that comes at a cost, typically in terms of performance. Building these skills will make your life considerably easier.