Press "Enter" to skip to content

Category: Data

Handling Large Delete Operations in TimescaleDB

Semab Tariq deletes a significant amount of data:

In today’s blog, we will discuss another crucial aspect of time-series data management: massive delete operations.

As your data grows over time, older records often lose their relevance but continue to occupy valuable disk space, potentially increasing storage costs and might degrade the performance if not managed well. 

Let’s walk through some strategies to clean up or downsample aged data in TimescaleDB, helping you maintain a lean, efficient, and cost-effective database.

The “or downsample” is huge, by the way: as a simple example, suppose you collect one record every millisecond, or 1000 per second. Say that we have a date+time and a few floating point numbers that add up to 40 bytes per record. If we have a year of data at that grain, we have 40 bytes/record * 1000 records/second * 3600 seconds/hour * 24 hours/day * 365.25 days/year, or 1,262,304,000,000 bytes/year. That’s ~1.15 terabytes of data per year, assuming no compression (which there actually is, but whatever). By contrast, if you keep millisecond-level data for a week, second-level for 3 weeks, and minute-level for the remaining year, you have:

40 bytes/record * 1000 records/second * 3600 seconds/hour * 24 hours/day * 7 days/week * 1 week = 22.53 gigabytes
40 bytes/record * 1 record/second * 3600 seconds/hour * 24 hours/day * 7 days/week * 3 weeks = 69 megabytes
40 bytes/record * 1 record/minute * 60 minutes/hour * 24 hours/day * 337.25 days = 18.5 megabytes

And for most cases, we only need the lowest level of granularity for a relatively short amount of time. After that, we typically care more about how the current data looks versus older data, for the purposes of trending.

Leave a Comment

Backfilling Data in TimescaleDB

Semab Tariq takes us through a problem:

Backfilling data into a TimescaleDB hypertable in production can be very tricky, especially when automated processes like compression policies are involved. From past experience, we have seen that if backfill operations aren’t handled properly, they can interfere with these automated tasks, sometimes causing them to stop working altogether. 

This blog covers a safer and more reliable approach to backfilling hypertables, along with best practices to prevent disruptions to compression and other background processes.

Read on for several tips. Backfills can be challenging in any database, but time-scale databases like TimescaleDB introduce their own unique issues.

Leave a Comment

Data Recovery in SQL Server without a Backup

Rodrigo Riberio Gomes digs in:

In more than 10 years of experience, I have dealt with cases where someone has performed incorrect operations on a table, such as updating or deleting wrong rows, in a SQL Server database that does not have full backups available. There are multiple reasons for no full backup: corrupted backups, taking too much time to restore, etc.

In this post, I want to show an alternative for these cases, an ace up one’s sleeve, that you can use to recover data. This method also provides a deep understanding of the internal workings of how your SQL Server stores data. So, in addition to learning how to recover data, you will gain more insights into the internals of SQL.

Read on to see how. Rodrigo also points out some limitations or things that would need to change if you have index compression. I consider this a very neat thing you might need to know but never want to use.

Comments closed

Counting NULLs in SQL Server Tables

Vlad Drumea counts that which does not exist:

I’ve seen variations of the question “how to count all NULLs in all columns of a table” pop up on reddit once every couple of months, and I figured I’d give it a shot and post here in case anyone else might need it in the future.

Click through for the script. It does require M*N scans, where M represents the average number of columns in each table and N the number of tables in the database. In other words, don’t expect an immediate response.

Comments closed

Data Conversion via Generative AI

Grant Fritchey rearranges some data:

The DM-32 is a Digital Mobile Radio (DMR) as well as an analog radio. You can follow the link to understand all that DMR represents when talking radios. I want to focus on the fact that you have to program the behaviors into a DMR radio. While the end result is identical for every DMR radio, how you get there, the programming software, is radically different for every single radio (unless you get a radio that supports open source OpenGD77, yeah, playing radio involves open source as well). Which means, if I have more than one DMR radio (I’m currently at 7, and no, I don’t have a problem, shut up) I have more than one Customer Programming Software (CPS) that is completely different from other CPS formats. Now, I like to set up my radios similarly. After all, the local repeaters, my hotspot, and the Talkgroups I want to use are all common. Since every CPS is different, you can’t just export from one and import to the next. However, I had the idea of using AI for data conversion. Let’s see how that works.

Click through for the scenario as well as Grant’s results. Grant’s results were pretty successful for a data mapping operation, though choice of model and simplicity of the input and output examples are important to generate the Python code.

Comments closed

What’s New with OneLake Shortcuts

Miquella de Boer gives us an update:

Microsoft Fabric shortcuts enable organizations to unify their data across various domains and clouds by creating a single virtual data lake. These shortcuts act as symbolic links to data in different storage locations, simplifying access and reducing the need for multiple copies.

OneLake serves as the central hub for all analytics data. By using OneLake shortcuts, organizations can connect to existing data sources like Azure and AWS through a unified namespace, streamlining workflows and enhancing collaboration.

Click through for several feature improvements for shortcuts.

Comments closed

Time Range Generation in Data Diluvium

Adron Hall extends Data Diluvium:

Following up on my previous posts about adding humidity and temperature data generation to Data Diluvium, I’m now adding a Time Range generator. I decided this would be a nice addition to give any graphing of the data a good look. This will complete the trio of generators I needed for my TimeScale DB setup. While humidity and temperature provide the environmental data, the Time Range generator ensures we have properly spaced time points for our time-series analysis.

Click through to see how it works.

Comments closed

Generating Realistic Data with Data Diluvium

Adron Hall wants to generate some realistic time-series data. First up is humidity data:

When working with TimeScale DB for time-series data, having realistic environmental data is crucial. I’ve found that humidity is a particularly important parameter that affects everything from agriculture to HVAC systems. 

Then comes temperature data:

Following up on my previous post about adding humidity data generation to Data Diluvium, I’m now adding temperature data generation. This completes the pair of environmental data generators I needed for my TimeScale DB setup. Temperature data is crucial for time-series analysis and works perfectly alongside the humidity data we just implemented.

Check out both of these for detailed coverage of how to generate realistic-looking simulated data.

Comments closed

The Design of DataDiluvium

Adron Hall wraps up a series on DataDiluvium. First up is some commentary on data generation:

In Parts 1 and 2, I set up the development environment and implemented the schema parsing functionality. Now, I’ll explore the data generation system and final implementation details that make DataDiluvium a complete solution.

And finally we have some additional notes and wrapup:

In my previous three posts, I covered the core functionality of DataDiluvium. In this follow-up post, I’ll explore the additional features, utilities, and implementation details that I’ve added to enhance the application’s functionality and developer experience.

Check out both posts. I do enjoy seeing people walk through and explain some of the key concepts and decisions they’ve made when developing solutions.

Comments closed

An Overview of DataDiluvium

Adron Hall has a new tool and a new blog series. The first post is a product overview:

DataDiluvium is a web-based tool available at datadiluvium.com that helps developers, database administrators, and data engineers generate realistic test data from SQL schema definitions. Whether you’re setting up a development environment, creating test scenarios, or preparing data for demonstrations, DataDiluvium streamlines the process of data generation.

The second covers some of the development precepts Adron used:

DataDiluvium is a web-based tool I’ve built designed to help developers, database administrators, and data engineers generate realistic test data based on SQL schema definitions. The tool takes SQL table definitions as input and produces sample data in various formats, making it easier to populate development and testing environments with meaningful data.

The tool is free, so if you’re looking for a sample data generator, check it out.

Comments closed