Generating Fake Data

Rich Benner shows us how to use the Faker library in Python to generate test data:

There are far more options when using Faker. Looking at the official documentation you’ll see the list of different data types you can generate as well as options such as region specific data.

Go have fun trying this, it’s a small setup for a large amount of time saved.

These types of tools can be great for generating a bunch of data but come with a couple of risks. One is that in the fake addresses Rich shows, ZIP codes don’t match their states at all, so if your application needs valid combos, it can cause issues. The other problem comes from distributions: generated data often gets created off of a uniform distribution, so you might not find skewness-related problems (e.g., parameter sniffing issues) strictly in your test data.

That said, easily generating test data is powerful and I don’t want to let the good be the enemy of the great.

Related Posts

Monte Carlo Simulation in Python

Kristian Larsen has a couple of posts on Monte Carlo style simulation in Python. First up is a post which covers how to generate data from different distributions: One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how […]

Read More

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow: Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info. Databricks Runtime 5.3 and 5.3 ML and above support […]

Read More


February 2019
« Jan Mar »