Ivan Palomares Carrascosa makes some data:
This article introduces the Faker library for generating synthetic datasets. Through a gentle hands-on tutorial, we will explore how to generate single records or data instances, full datasets in one go, and export them into different formats. The code walkthrough adopts a twofold perspective:
- Learning: We will gain a basic understanding of several data types that can be generated and how to get them ready for further processing, aided by popular data-intensive libraries like Pandas
- Testing: With some generated data at hand, we will provide some hints on how to test data issues in the context of a simplified ETL (Extract, Transform, Load) pipeline that ingests synthetically generated transactional data.
Click through for the article. I’m not intimately familiar with Faker, so I’m not sure how easy it is to change dataset distributions. That’s one of the challenges I tend to have with automated data generators: generating a simulated dataset is fine if you just need X number of rows, but if the distribution of synthetic data in development is nowhere near what the real data’s distribution is in production, you may get a false sense of security in things like report response times.
Leave a Comment