Press "Enter" to skip to content

Day: December 9, 2025

Using Haskell for Data Science

Jonathan Carroll has my attention:

I’ve been learning Haskell for a few years now and I am really liking a lot of the features, not least the strong typing and functional approach. I thought it was lacking some of the things I missed from R until I found the dataHaskell (www.datahaskell.org) project.

There have been several attempts recently to enhance R with some strong types, e.g.  vapour (vapour.run), typr (github.com), using {rlang}’s checks (josiahparry.com), and even discussions about implementations at the core level e.g.  in September 2025 (stat.ethz.ch) continued in November 2025 (stat.ethz.ch). While these try to bend R towards types, perhaps an all-in solution makes more sense.

In this post I’ll demonstrate some of the features and explain why I think it makes for a good (great?) data science language.

I’ve been a big fan of F# for data science work as well for similar reasons, so it was interesting to read this article on Haskell. H/T R-Bloggers.

Leave a Comment

IBM Acquires Confluent

Confluent has an announcement:

We are excited to announce that Confluent has entered into a definitive agreement to be acquired by IBM. After the transaction is closed (subject to customary closing conditions and regulatory approvals), together, IBM and Confluent will aim to provide a platform that unifies the world’s largest enterprises, unlocking data for cloud/microservices, accelerating time-to-value, and building the real-time data foundation required to scale AI across every organization. 

Whelp. I suppose it was bound to happen at some point, but I definitely can’t say this news pleases me.

Leave a Comment

Generating Shape-Bound Random Points in SQL Server

Sebastiao Pereira generates some numbers:

Random number generation is vital in computer science, supporting fields like optimization, simulation, robotics, and gaming. The quality, speed, and sometimes security of the generator can directly affect an algorithm’s correctness, performance, and competitiveness. In Python, random number generation is well-supported and widely used. In this article, we will look how to we can use SQL to do this.

Click through for several examples.

Leave a Comment

SQL Managed Instance Memory vs Cores

Kendra Little hits a pain point:

Microsoft recently announced that Azure SQL Managed Instance Next-gen General Purpose (GPv2) is now generally available. GPv2 brings significant storage performance improvements over GPv1, and if you’re using GPv1, you should plan to upgrade.

But GPv2 still has the same memory-to-core ratio problem that makes Managed Instance a rough deal for running SQL Server. SQL Server is engineered to use lots of memory—it’s a rare OLTP or mixed-OLTP workload that doesn’t need significant cache for reliable performance. We’ll have a look at the pricing math.

Read on for Kendra’s detailed thoughts on GPv2 versus GPv1 and also how GPv2 still has its warts.

Leave a Comment

Accessing Excel Files from OneDrive via Power BI

Kristyna Ferris is happy:

I can’t believe it’s finally here! A way to have Excel live in OneDrive and access it from Power BI nearly live! We can officially short cut files to our OneLake from both SharePoint and OneDrive! I am super excited about this feature, and I hope you are too. This feature plus User Data Functions allows us to not only have data from Excel in our reports but keep it as fresh as needed. Imagine having budget allocations that you want to adjust right before or during a meeting. Now you can! You can edit a file in Excel and hit one button to see the new numbers in your report. In the past, we relied on 3rd party services or Power Apps licensing to accomplish this sort of experience. Now we can just use Excel, an old data friend.

Kristyna does note that this is in preview, so take it with that caveat in mind and read on to see how it all works.

Leave a Comment

UUIDv4 and UUIDv7 in PostgreSQL 18

Josef Machytka notes a change:

In the past there have been many discussions about using UUID as a primary key in PostgreSQL. For some applications, even a BIGINT column does not have sufficient range: it is a signed 8‑byte integer with range −9,223,372,036,854,775,808 to +9,223,372,036,854,775,807. Although these values look big enough, if we think about web services that collect billions or more records daily, this number becomes less impressive. Simple integer values can also cause conflicts of values in distributed system, in Data Lakehouses when combining data from multiple source databases etc.

However, the main practical problem with UUIDv4 as a primary key in PostgreSQL was not lack of range, but the complete randomness of the values. This randomness causes frequent B‑tree page splits, a highly fragmented primary key index, and therefore a lot of random disk I/O. There have already been many articles and conference talks describing this problem. What many of these resources did not do, however, was dive deep into the on‑disk structures. That’s what I wanted to explore here.

In fairness to BIGINT, if you collect 10 billion records a day, that’s still about 180 million days of activity before you need to reset the ID, if I got my math right. But read on to see what UUIDv7 does for GUIDs.

Leave a Comment

Troubleshooting a Distributed Availability Group Failure

Jordan Boich digs in:

To give some background on this topology, they have a DAG comprised of two individual AGs. Their Global Primary AG (we’ll call this AG1) has three replicas, while their Forwarder AG (we’ll call this AG2) has two replicas. The replicas in AG1 are all in the same subnet and all Azure VMs. The replicas in AG2 are all in their own same subnet and all Azure VMs.

By the time we got logged in and connected, the Global Primary Replica was online and was able to receive connections. The secondary replicas in the Global Primary AG however, were unable to communicate with the Global Primary Replica. This is problem 1. The other secondary problem is that several databases on the Forwarder Primary Replica were in crash recovery. This is problem 2. Considering problem 2 was happening in the DR environment, that was put aside for the time being.

Read on for the troubleshooting process and solution.

Leave a Comment