Press "Enter" to skip to content

Category: Data

DBAs and Data Access

Brent Ozar wraps up a survey:

Last week, I asked if your database administrators could read all of the data in all databases. The results (which may be different from this post, because I’m writing the post ahead of time and the poll is still open):

In a lot of cases, this doesn’t really matter much. In places where it does matter (for example, reading protected health information or critical financial data), there should be controls in place. I’ve always been on the side of this issue that says that yes, you do need to be able to trust your administrators at the end of the day, because somebody’s going to need a way to get to that data in case of company emergency. But as a company grows and there are additional opportunities for division of labor and specialization, you do open up the possibility of stronger controls, proper auditing, limiting certain data access to privileged accounts, and consequences for violating the rules.

Leave a Comment

Generating Synthetic Data in Python

Ivan Palomares Carrascosa makes some data:

This article introduces the Faker library for generating synthetic datasets. Through a gentle hands-on tutorial, we will explore how to generate single records or data instances, full datasets in one go, and export them into different formats. The code walkthrough adopts a twofold perspective:

  1. Learning: We will gain a basic understanding of several data types that can be generated and how to get them ready for further processing, aided by popular data-intensive libraries like Pandas
  2. Testing: With some generated data at hand, we will provide some hints on how to test data issues in the context of a simplified ETL (Extract, Transform, Load) pipeline that ingests synthetically generated transactional data.

Click through for the article. I’m not intimately familiar with Faker, so I’m not sure how easy it is to change dataset distributions. That’s one of the challenges I tend to have with automated data generators: generating a simulated dataset is fine if you just need X number of rows, but if the distribution of synthetic data in development is nowhere near what the real data’s distribution is in production, you may get a false sense of security in things like report response times.

Leave a Comment

Comparing Storage Options in PostgreSQL

Hans-Jürgen Schönig compares data sizes:

In this case study, we’ll delve into each of PostgreSQL’s main storage options, their characteristics, and the factors that influence their choice, enabling you to make informed decisions about your database’s storage strategy. You will also learn how you can archive data in a hybrid environment for long term storage. 

Click through for a comparison of two common file formats, plus two PostgreSQL-specific mechanisms for data storage. The comparison here is mostly one of final file size, though common query performance would be interesting to include as well, especially because the columnar data file types (Parquet and Citus-based columnstore) have a very different performance profile versus row-store data.

Comments closed

Costs of Over-Instrumentation

Lenard Lim shares a warning:

If you’ve ever opened a product analytics dashboard and scrolled past dozens of unlabeled metrics, charts with no viewers, and events no one can explain—welcome to the world of metric sprawl.

In my roles at a MAANG company and a remittance fintech, I’ve seen product teams obsessed with instrumenting everything: every click, every scroll, every hover, every field. The thinking is, “Better to have it and not need it than need it and not have it.”

But there’s a hidden cost to this mindset. And it’s time we talk about it.

I personally tend toward wanting as much information as possible, though Lenard makes good points around the friction that adds, as well as potential degradations in user experience.

Comments closed

Handling Large Delete Operations in TimescaleDB

Semab Tariq deletes a significant amount of data:

In today’s blog, we will discuss another crucial aspect of time-series data management: massive delete operations.

As your data grows over time, older records often lose their relevance but continue to occupy valuable disk space, potentially increasing storage costs and might degrade the performance if not managed well. 

Let’s walk through some strategies to clean up or downsample aged data in TimescaleDB, helping you maintain a lean, efficient, and cost-effective database.

The “or downsample” is huge, by the way: as a simple example, suppose you collect one record every millisecond, or 1000 per second. Say that we have a date+time and a few floating point numbers that add up to 40 bytes per record. If we have a year of data at that grain, we have 40 bytes/record * 1000 records/second * 3600 seconds/hour * 24 hours/day * 365.25 days/year, or 1,262,304,000,000 bytes/year. That’s ~1.15 terabytes of data per year, assuming no compression (which there actually is, but whatever). By contrast, if you keep millisecond-level data for a week, second-level for 3 weeks, and minute-level for the remaining year, you have:

40 bytes/record * 1000 records/second * 3600 seconds/hour * 24 hours/day * 7 days/week * 1 week = 22.53 gigabytes
40 bytes/record * 1 record/second * 3600 seconds/hour * 24 hours/day * 7 days/week * 3 weeks = 69 megabytes
40 bytes/record * 1 record/minute * 60 minutes/hour * 24 hours/day * 337.25 days = 18.5 megabytes

And for most cases, we only need the lowest level of granularity for a relatively short amount of time. After that, we typically care more about how the current data looks versus older data, for the purposes of trending.

Comments closed

Backfilling Data in TimescaleDB

Semab Tariq takes us through a problem:

Backfilling data into a TimescaleDB hypertable in production can be very tricky, especially when automated processes like compression policies are involved. From past experience, we have seen that if backfill operations aren’t handled properly, they can interfere with these automated tasks, sometimes causing them to stop working altogether. 

This blog covers a safer and more reliable approach to backfilling hypertables, along with best practices to prevent disruptions to compression and other background processes.

Read on for several tips. Backfills can be challenging in any database, but time-scale databases like TimescaleDB introduce their own unique issues.

Comments closed

Data Recovery in SQL Server without a Backup

Rodrigo Riberio Gomes digs in:

In more than 10 years of experience, I have dealt with cases where someone has performed incorrect operations on a table, such as updating or deleting wrong rows, in a SQL Server database that does not have full backups available. There are multiple reasons for no full backup: corrupted backups, taking too much time to restore, etc.

In this post, I want to show an alternative for these cases, an ace up one’s sleeve, that you can use to recover data. This method also provides a deep understanding of the internal workings of how your SQL Server stores data. So, in addition to learning how to recover data, you will gain more insights into the internals of SQL.

Read on to see how. Rodrigo also points out some limitations or things that would need to change if you have index compression. I consider this a very neat thing you might need to know but never want to use.

Comments closed

Counting NULLs in SQL Server Tables

Vlad Drumea counts that which does not exist:

I’ve seen variations of the question “how to count all NULLs in all columns of a table” pop up on reddit once every couple of months, and I figured I’d give it a shot and post here in case anyone else might need it in the future.

Click through for the script. It does require M*N scans, where M represents the average number of columns in each table and N the number of tables in the database. In other words, don’t expect an immediate response.

Comments closed

Data Conversion via Generative AI

Grant Fritchey rearranges some data:

The DM-32 is a Digital Mobile Radio (DMR) as well as an analog radio. You can follow the link to understand all that DMR represents when talking radios. I want to focus on the fact that you have to program the behaviors into a DMR radio. While the end result is identical for every DMR radio, how you get there, the programming software, is radically different for every single radio (unless you get a radio that supports open source OpenGD77, yeah, playing radio involves open source as well). Which means, if I have more than one DMR radio (I’m currently at 7, and no, I don’t have a problem, shut up) I have more than one Customer Programming Software (CPS) that is completely different from other CPS formats. Now, I like to set up my radios similarly. After all, the local repeaters, my hotspot, and the Talkgroups I want to use are all common. Since every CPS is different, you can’t just export from one and import to the next. However, I had the idea of using AI for data conversion. Let’s see how that works.

Click through for the scenario as well as Grant’s results. Grant’s results were pretty successful for a data mapping operation, though choice of model and simplicity of the input and output examples are important to generate the Python code.

Comments closed