Press "Enter" to skip to content

Month: March 2024

Using the map() Function in purrr

Steven Sanderson reads the map():

In the world of data manipulation and analysis with R, efficiency and simplicity are paramount. One function that epitomizes these qualities is map(). Whether you’re a novice or a seasoned R programmer, mastering map() can significantly streamline your workflow and enhance your code readability. In this guide, we’ll delve into the syntax, usage, and numerous examples to help you harness the full power of map().

Click through for examples of how this works in R.

Leave a Comment

A Bayesian Approach to CATPCHAs

John Cook claims to be human:

I set up a GitHub account for a new employee this morning and spent a ridiculous amount of time proving that I’m human.

The captcha was to listen to three audio clips at a time and say which one contains bird sounds. This is a really clever test, because humans can tell the difference between real bird sounds and synthesized bird-like sounds. And we’re generally good at recognizing bird sounds even against a background of competing sounds. But some of these were ambiguous, and I had real birds chirping outside my window while I was doing the captcha.

You have to do 20 of these tests, and apparently you have to get all 20 right. I didn’t. So I tried again. On the last test I accidentally clicked the start-over button rather than the submit button. I wasn’t willing to listen to another 20 triples of audio clips, so I switched over to the visual captcha tests.

Read on to see how a Bayesian approach to the problem could make things a bit less annoying.

Leave a Comment

Announcements from the Microsoft Fabric Community Conference

James Serra gives us the round-up:

A ton of new features for Microsoft Fabric were announced at the Microsoft Fabric Community Conference. Here are all the new features I am aware of, with some released now and others coming soon:

  • Mirroring is now in public preview for Cosmos DB, Azure SQL DB and Snowflake. See Announcing the Public Preview of Database Mirroring in Microsoft Fabric
  • You get a free terabyte of Mirroring storage for replicas for every capacity unit (CU) you have purchased and provisioned. For example, if you purchase F64, you will get sixty-four free terabytes worth of storage for your mirrored replicas

Click through for a couple dozen more announcements. They’ve been quite busy on Microsoft Fabric.

Leave a Comment

Mirroring an Azure SQL Database in Microsoft Fabric

Gilbert Quevuavilliers holds up a mirror:

Creating a Mirrored Azure SQL Database in Fabric

This week they announced Announcing the Public Preview of Database Mirroring in Microsoft Fabric | Microsoft Power BI Blog | Microsoft Power BI

I decided to see how easy it was to create a mirrored database in Fabric and below are my findings (PS it is AMAZING)

Click through for the demo. Though it does look like Gilbert has mirrored the contents of the blog post a few times as well, at least as of the time of my post here.

Leave a Comment

SSMS 20 and Mandatory Connection Security

Chad Callihan hits an annoyance:

I tried to run a new query for a CMS but the query window opened as disconnected. If I selected one server out of the group and tried to open a new query, I received an error that “A connection was successfully established with the server, but then an error occurred during the login process.”

That can get really annoying if you have a few hundred instances in your Central Management Server. They’d all go away if you set up certificates for the servers, but until then, it would be a major annoyance.

Leave a Comment

Parallel Vector Index Rebuild in Postgres

Semab Tariq takes a look at parallel index building in pgvector:

Parallel Index Build refers to the capability to build indexes using parallel processing. In simpler terms, it means that multiple workers or threads can be utilized simultaneously to create an index, which can significantly speed up the index creation process.

When performing an index build operation, PostgreSQL can divide the work among multiple parallel workers, each responsible for building a portion of the index.

Read on to learn more about this bit of functionality in pgvector 0.6 and the performance gains you can get from it.

Leave a Comment

Processing GitHub Data with Kafka Streams

Lucia Cerchie hits the GItHub API:

GitHub’s data sources (REST + GraphQL APIs) are not only developer-friendly, but a goldmine of interesting statistics on the health of developer communities. Companies like OpenSaucedlinearb, and TideLift can measure the impact of developers and their projects using statistics gleaned from GitHub’s APIs. The results of GitHub analysis can change both day-to-day and over time. 

Apache Kafka is a large and active open source project with nearly a million lines of code. It also happens to be an event streaming platform. So why not use Apache Kafka to, well, monitor itself? And learn a bit about Kafka Streams along the way?  

Click through for the full article, including a demonstration.

Leave a Comment

Working with INTERSECT and EXCEPT

Erik Darling wounds me:

I have never once seen anyone use these. The most glaring issue with them is that unlike a lot of other directives in SQL, these ones just don’t do a good job of telling you what they do, and their behavior is sort of weird.

Unlike EXISTS and NOT EXISTS, which state their case very plainly, as do UNION and UNION ALL, figuring these out is not the most straightforward thing. Especially since INTERSECT has operator precedence rules that many other directives do not.

I’ve used EXCEPT to check if two datasets are equivalent for testing purposes: A EXCEPT B should be zero rows, and B EXCEPT A should be zero rows. It has built-in handling of any NULL madness. Set intersections have their uses as well.

Leave a Comment

Working with TRY-CATCH in SQL Server

Steve Jones gives it the ol’ college try:

This is a common error handling technique in other languages. C# uses it, as does Java, while Python has TRY EXCEPT. There are other examples, but these are good habits to get into when you don’t know how code will behave or if there is something in your data or environment that could cause an issue.

In SQL, I think many of us get used to writing one statement in a query and forget to do error handling, or transactions. However, this can be a good habit as your code might grow and people might add more statements that should execute.

Read on for a few examples of how to use SQL Server’s TRY-CATCH functionality. It’s not perfect, but as Steve shows, there are definitely good uses for it.

Leave a Comment

Comparing pg_basebackup Compression Settings

Kaarel Moppel puts on the lab coat and safety glasses:

In my last post I did a quick check on the performance of the newer (lz4, zstd) pg_dump compression options, which included setting up a small framework to download some openly available “real life”-ish sample datasets. And the general result was that, indeed – the new algos in lower levels provide the best value, especially zstd.

But pg_dump is about compressing essentially text based data…but how about binary Postgres data? Thus the tool to test here additionally is pg_basebackup, with its newer (v15+) compression options. So let’s see if something stands out consistently again.

Click through for the test results.

Leave a Comment