Press "Enter" to skip to content

Category: Dates and Numbers

Power BI Data Type Optimization

Nikola Ilic shows how important it can be to choose the right data types:

For demo purposes, I’ll be using a fact table that contains the data about chats performed by a customer support department of the fictitious company Customer First. This table includes approximately 9 million rows, which is not considered a large table in the context of Power BI and analytical workloads. For the sake of simplicity, let’s pretend that our model consists of only this single table. Finally, a semantic model is configured as an Import mode model. If you want to learn how your data is stored in Power BI, I suggest you start by reading this article first.

Data was loaded into Power BI from the underlying data source (SQL Server database) as-is, without any additional optimizations applied.

Nikola walks through the process of finding the most expensive columns in terms of data size and using the least precise acceptable value. One other thing that I commonly see is identity columns or other keys on fact tables. Those are very rarely necessary, because the point of a fact table is typically to aggregate it in some fashion. And these keys are unique (by design), meaning they won’t compress very well and will take up a lot of space. Looking at Nikola’s example, my next question would be, knowing that the name of the table is factChat, does chatID tie to some chat dimension? If not, is it actually necessary for reporting? Again, if not, that could shave off another 60 MB or so from the data model.

Comments closed

Avoid Mixing DATETIME with other Date Types

Paul White shares some advice:

Microsoft encourages us not to use the datetime data type: 

Avoid using datetime for new work. Instead, use the time, date, datetime2, and datetimeoffset data types. These types align with the SQL Standard, and are more portable. time, datetime2 and datetimeoffset provide more seconds precision. datetimeoffset provides time zone support for globally deployed applications.

Well, ok. Sensible and well-informed people might still choose to use datetime for performance reasons. Common date and time functions have optimised implementations in the SQL Server expression service for the datetime and smalldatetime data types.

Paul has posted the full article on X.

Comments closed

Rounding Options in T-SQL

Rick Dobson talks rounding:

Please compare the SQL Server round function to banker’s rounding in T-SQL for converting decimal values to integer values. I seek a framework for assessing how closely banker’s rounding results versus SQL Server Round function results match the underlying decimal values. Please provide a couple of empirical comparisons with the framework to indicate which set of rounded values are closer to the underlying decimals and by how much.

Rick talks about what banker’s rounding is and shows how its results adhere more closely to the underlying distribution. Rick does show a user-defined function that generates a rounded number, but if you’re doing this with large enough amounts of data, using CLR and the System.Math.Round() function will likely give you better performance. Incidentally, this is also why if you write T-SQL code to round decimal numbers and .NET code to round numbers, your results may be a little different: T-SQL rounds to the nearest integer, whereas .NET uses banker’s rounding by default.

Comments closed

Translating Numbers to Words in T-SQL

Sebastiao Pereira needs no number:

Sometimes, it is necessary to have numbers spelled out in words, like when writing a sentence (“Two hundred sixty-one victims were hospitalized.”). How do you express numbers in words with T-SQL code?

I could see a bit more recursion in the solution making it tidier, but in fairness, a bit more recursion would make it a lot slower in T-SQL.

For moderate difficulty mode, try it in German. For hard mode, switch to French.

Comments closed

Removing Leading Zeroes from a String in T-SQL

Steve Stedman gets rid of leading zeroes:

When working with data in SQL Server, there may be times when you need to remove leading zeros from a string. This task can be particularly common when dealing with numerical data stored as strings, such as ZIP codes, product codes, or other formatted numbers. In this blog post, we’ll explore several methods to remove leading zeros in SQL Server.

I’m not sure I see the reason to use anything other than CAST() (or, better yet, TRY_CAST()), but Steve does show two other methods.

2 Comments

Random Date Generation in Python

Chris LaGreca spits out some dates:

I often work with time series data and find it useful to have a variety of ways to randomly generate dates. This particular example is great for evenly distributed date partitions. Running the script below with the default arguments will output a list of random dates, one for each month of the year.

It looks like this is generating based off of a uniform distribution, which probably makes the most sense for “give me a day of the month” data generation.

Comments closed

An Introduction to pg_timeseries

Samay Sharma and Jason Petersen have an announcement:

We are excited to launch pg_timeseries: a PostgreSQL extension focused on creating a cohesive user experience around the creation, maintenance, and use of time-series tables. You can now use pg_timeseries to create time-series tables, configure the compression and retention of older data, monitor time-series partitions, and run complex time-series analytics functions with a user-friendly syntax. pg_timeseries is open-sourced under the PostgreSQL license and can be added to your existing PostgreSQL installation or tried as a part of the Timeseries Stack on Tembo Cloud.

Read on to learn more about how it works. The syntax and concepts do remind me a good bit of InfluxDB, as well.

Comments closed

Arbitrary Intervals for Partitioning in Postgres

Keith Fiske does a bit of interval math:

Whether you are managing a large table or setting up automatic archiving, time based partitioning in Postgres is incredibly powerful. pg_partman’s newest versions support a huge variety of custom time internals. Marco just published a post on using pg_partman with our new database product for doing analytics with PostgresCrunchy Bridge for Analytics. So I thought this would be a great time to review the basic and complex options for the time based partitioning.

Read on for a note of how pg_partman works and interval management, especially for versions earlier than 5.0.

Comments closed

Adding the Current Date and Time to a PySpark Data Frame

Gilbert Quevauvilliers wants to know what time it is:

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

In the blog post below, I am going to describe how to add the current Date Time to your existing Spark data frame.

This is really useful when I am inserting data into a Fabric Lakehouse table, and I want to know when the data got inserted.

Read on for the answer.

Comments closed

Migrating DATETIME Data to DATETIMEOFFSET

William Assaf adds some time zones:

I recently reviewed, worked on, and added a similar example to the DATETIMEOFFSET Microsoft Learn Docs article at the recommendation of my colleague Randolph West, who guessed (accurately) I would enjoy such a task. It was a nice pre-Build diversion. 

This topic is one that I have co-presented on in the past and hounded project capstone review presentations about. If you’re not storing time zone offset in your date/time data, you’re setting yourself up for future pain. That future pain is not what this blog post is about.

My preference is not to store time zone offset but instead store everything in UTC and perform any time zone switcharoos in the UI. But if you are storing local dates and times, I completely agree that you should keep track of the time zone. I worked for an east coast US company that bought a west coast US company, and both stored local dates and times in their SQL Server databases, making data consolidation a real challenge.

Comments closed