Press "Enter" to skip to content

Day: March 4, 2025

Bass Product Diffusion and Data Science

John Mount does a fun analysis:

This is a graph of the percentage of Stack Overflow questions tagged with data science terms such as R, Pandas, and so on. It seems to show exploding interest in R and Pandas, and maybe even Tensorflow. Pandas was likely chosen as a proxy for interest in Python for data science (versus a general interest in Python). I’d prefer view counts over question percentages as a proxy of interest, but it is what it is.

Then I thought, let’s see if they have newer data. They do, and it is horrifying (though not unexpected to those of us in the industry).

Click through for the analysis, as well as an important note in the comments.

Leave a Comment

Self-Hosted Integration Runtime Reconnecting to Cloud Service

Nivritti Suste handles an error:

In our organization, most data is stored on-premises with a limited set of less critical data is in the cloud. We use Azure to benefit from the cloud environment and Azure Data Factory (ADF) to move data.

With ADF, there are many components that need to integrate within the environment. The data on our on-premises servers needs to be shifted to the cloud periodically and we use Self-hosted Integration Runtime.

Our developers complain an ADF pipeline is failing with error: ‘The Self-hosted Integration Runtime is offline…’ What does this mean?

Click through for the answer.

Leave a Comment

A Mistake of “Normalization”

Hans-Jürgen Schönig makes an argument:

The concept of “normalization” is often the first thing people who are new to databases are going to learn. We are talking about one of the fundamental principles in the realm of databases. But what is the use of normalization in the first place? Well, we want to avoid redundancies in the data and make sure that information is stored in a way that helps reduce mistakes and inconsistencies. Ultimately, that is all there is to it: No redundancies, no mistakes, no inconsistencies.

There’s an example in this of “too much normalization” but I’m going to push back because this is a common misunderstanding of the idea of normalization.

The example covers removing price from an invoice table and having people look up the price from the product table, as having each price in an invoice is duplication, and we’re trying to eliminate duplication.

This argument is wrong, because it conflates two concepts. The listing price of an item is its current price. This is the thing you will see on a products table. The sale price of an item on the invoice table is a historical artifact and is not the same as the listing price, even if the dollar amounts match. Hans-Jürgen points out the consequence of making this mistake, and is correct in pointing this out. But it’s not “too much normalization” because it misunderstands the domain model and eliminating sale price from a table would remove information. Properly following the rules of normalization means you cannot lose information–that’s what each one of the normal forms does. In this case, we remove an attribute based on a faulty assumption that there is a functional dependency between product ID and sale price (that is, every time you see a specific product ID, you will always see a specific sale price). That’s the crux of the issue in this example, but the concept of normalization takes strays as a result of the faulty assumed functional dependency.

Leave a Comment

Dealing with Optional Carriage Returns in SSIS

Andy Brownsword has fun with file formats:

When ingesting files in SSIS via Flat File Connections, a consistent format is key. Sometimes that isn’t the case. Here we’ll look at an example where the carriage return (CR\r) may or may not be included in the file.

Pepperidge Farms remembers back in the day when Windows, MacOS, and Linux (or any flavor of UNIX for that matter) each had a different way of ending a line: line feed, carriage return, or both. And of course most tools weren’t smart enough to figure out which your particular text file followed and display it correctly.

Leave a Comment