Press "Enter" to skip to content

Category: Data

Interesting Data is Usually Wrong

Mike Cisneros breaks the bad news:

Tony Twyman made his name as a pioneer in the field of audience research for television and radio in the UK. For our discussion today, though, he’s best remembered for a single, enduring quotation, which is now known as Twyman’s Law:

“Any figure that looks interesting or different is usually wrong.”

Read on for a good example of how the hunt for an interesting story turned into something resolutely normal after fixing a pair of data issues.

Leave a Comment

Thoughts on “Real-Time Decisions”

Steve Jones is skeptical:

To be fair, humans might do the same thing and over-react, but mostly we become hesitant with unexpected news. That slowness can be an asset. We often need time to think and come to a decision. Lots of our decisions aren’t always based on hard facts, and a lot of business isn’t necessarily fact driven either. We often put our thumb on the scales when making decisions because there isn’t a clear path based on just data.

Steve’s thrust is about AI but I want to riff on “real-time” in general. First, my standard rant: “real-time” has a specific meaning that people have abused over the years. Fighter pilots need real-time systems. The rest of it is “online.” For a hint as to the difference: if you’re okay waiting 100ms for a response due to network delays or whatever else, that’s not real-time.

Standard rant aside, “I need to see real-time data” is a common demand for data warehousing projects. I worked on a warehouse once where the business wanted up-to-the-minute data but our incoming data sources for cost and revenue information refreshed once a day per customer and intraday information was sketchy enough that it didn’t make sense to store in a warehouse. But when you probe people on how often they’ll look at the data, it turns out that hourly or daily loads make more sense based on the review cadence.

The question to ask is, how big is your OODA loop and is additional information really the limiting factor? Sometimes that answer can be yes, but generally there are other factors preventing action.

Leave a Comment

Generating Exponential Random Numbers in T-SQL

Sebastiao Pereira generates more artificial data:

Generating random numbers from an exponential distribution is essential for queuing theory, reliability engineering, physics, finance modeling, failure analysis, Poisson process, simulation and Monte Carlo methods, computer graphics, and games. Is it possible to have a Random Exponential Gaussian Numbers function in SQL Server without use of external tools?

As always, I love this series because these examples are complex enough not to be trivial, yet perform well enough to work in real-world environments.

Leave a Comment

DBAs and Data Access

Brent Ozar wraps up a survey:

Last week, I asked if your database administrators could read all of the data in all databases. The results (which may be different from this post, because I’m writing the post ahead of time and the poll is still open):

In a lot of cases, this doesn’t really matter much. In places where it does matter (for example, reading protected health information or critical financial data), there should be controls in place. I’ve always been on the side of this issue that says that yes, you do need to be able to trust your administrators at the end of the day, because somebody’s going to need a way to get to that data in case of company emergency. But as a company grows and there are additional opportunities for division of labor and specialization, you do open up the possibility of stronger controls, proper auditing, limiting certain data access to privileged accounts, and consequences for violating the rules.

Comments closed

Generating Synthetic Data in Python

Ivan Palomares Carrascosa makes some data:

This article introduces the Faker library for generating synthetic datasets. Through a gentle hands-on tutorial, we will explore how to generate single records or data instances, full datasets in one go, and export them into different formats. The code walkthrough adopts a twofold perspective:

  1. Learning: We will gain a basic understanding of several data types that can be generated and how to get them ready for further processing, aided by popular data-intensive libraries like Pandas
  2. Testing: With some generated data at hand, we will provide some hints on how to test data issues in the context of a simplified ETL (Extract, Transform, Load) pipeline that ingests synthetically generated transactional data.

Click through for the article. I’m not intimately familiar with Faker, so I’m not sure how easy it is to change dataset distributions. That’s one of the challenges I tend to have with automated data generators: generating a simulated dataset is fine if you just need X number of rows, but if the distribution of synthetic data in development is nowhere near what the real data’s distribution is in production, you may get a false sense of security in things like report response times.

Comments closed

Comparing Storage Options in PostgreSQL

Hans-Jürgen Schönig compares data sizes:

In this case study, we’ll delve into each of PostgreSQL’s main storage options, their characteristics, and the factors that influence their choice, enabling you to make informed decisions about your database’s storage strategy. You will also learn how you can archive data in a hybrid environment for long term storage. 

Click through for a comparison of two common file formats, plus two PostgreSQL-specific mechanisms for data storage. The comparison here is mostly one of final file size, though common query performance would be interesting to include as well, especially because the columnar data file types (Parquet and Citus-based columnstore) have a very different performance profile versus row-store data.

Comments closed

Costs of Over-Instrumentation

Lenard Lim shares a warning:

If you’ve ever opened a product analytics dashboard and scrolled past dozens of unlabeled metrics, charts with no viewers, and events no one can explain—welcome to the world of metric sprawl.

In my roles at a MAANG company and a remittance fintech, I’ve seen product teams obsessed with instrumenting everything: every click, every scroll, every hover, every field. The thinking is, “Better to have it and not need it than need it and not have it.”

But there’s a hidden cost to this mindset. And it’s time we talk about it.

I personally tend toward wanting as much information as possible, though Lenard makes good points around the friction that adds, as well as potential degradations in user experience.

Comments closed