Press "Enter" to skip to content

Day: June 3, 2024

Reviewing Experimental Results in the Process

John Cook talks philosophy of statistics:

Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

John also has a follow-up article:

Suppose you design an experiment, an A/B test of two page designs, randomizing visitors to Design A or Design B. You planned to run the test for 800 visitors and you calculated some confidence level α for your experiment.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Read on for a very interesting discussion of the topic. I’m definitely in the Bayesian camp: learn quickly update frequently, particularly early on when you have little information on the topic and the marginal value of learning one additional piece of information is so high.

Comments closed

Generating Data in SQL Server based on Distributions

Rick Dobson builds some data:

I support a data science team that often asks for datasets with different distribution values in uniform, normal, or lognormal shapes. Please present and demonstrate the T-SQL code for populating datasets with random values from each distribution type. I also seek graphical and statistical techniques for assessing how a random sample corresponds to a distribution type.

This is an interesting article, though if you want a set-based version of generating data according to a normal distribution, I have a blog post where I translated the RBAR version into something that performs a bit better. Converting to log-normal form also makes a lot of intuitive sense.

Comments closed

Comparing Microsoft Fabric Warehouse and Lakehouse Performance

Reitse Eskens busts out the stopwatch:

I just can’t seem to stop doing this, checking the limits of Microsoft Fabric. In this instalment I’ll try and find some limits on the data warehouse experience and compare them with the Lakehouse experience. The data warehouse is a bit different compared to the Lakehouse, so I’ll be digging into that one first. Then I’m going to load data into the warehouse with a copy data pipeline followed by some big queries to test performance. The Fabric Capacity App will be used to check out the capacity necessary (or used for that matter).

As usual, I’m using the F2 capacity as it’s the one that should break the easiest. It’s also the cheapest one to run tests against and, as the capacity calculation isn’t dependent on the SKU (Stock Keeping Unit), you can easily translate to find out which capacity SKU will fit the workload. Remember that your workload will differ from the one shown in this blog. These tests are a comparison between the different offerings, something you could do for yourself. These blogs are a bit of a happy place as every option will get a good chance. In your work, your skills (and those of your co-workers) will be a major driver towards an option. Even if this offers the chance to learn something new!

Reitse focuses on ingesting and transforming data and the results were quite interesting.

Comments closed

Invoking a Fabric Data Factory Pipeline via REST API

Andy Leonard makes a call:

This post is current as of 30 May 2024. There are other posts by fantastic bloggers about how to use the Fabric REST API. Fabric development is progressing so fast, some of those posts are less up-to-date. Make no mistake, this post will most likely not age well, and for the very same reason. That’s ok. We bloggers live to serve. I, like all the rest, will endeavor to persevere – and we will all write more posts, Lord willing.

In this post, I share one way to invoke Fabric Data Factory pipelines using the REST API.
I will be using the web version of Postman to call REST API methods.
You can sign up for a free Postman account. Since it’s free, I encourage you to check the box to receive news and offers from them. As I mentioned in an earlier post, you can always unsubscribe if the messages are unhelpful or if they get too “chatty.”

Read on for that way.

Comments closed

Using PostGIS in the Terminal

Dian M. Fay talks turkey about terminals:

Of late, I’ve been falling down a bunch of geospatial rabbit holes. One thing has remained true in each of them: it’s really hard to debug what you can’t see.

There are ways to visualize these. Some more-integrated SQL development environments like pgAdmin recognize and plot columns of geometry type. There’s also the option of standing up a webserver to render out raster and/or vector tiles with something like Leaflet. Unfortunately, I don’t love either solution. I like psql, vim, and the shell, and I don’t want to do some query testing here and copy others into and out of pgAdmin over and over; I’m actually using Leaflet and vector tiles already, but restarting the whole server just to start debugging a modified query is a bit much in feedback loop time.

Read on for Dian’s recommendations.

Comments closed

Impossible Execution Plan Timings

Paul White puts up an article:

I showed a hidden option to make all operators report only their individual times in

More Consistent Execution Plan Timings in SQL Server 2022

. That feature isn’t complete yet, so the results aren’t perfect, and it’s not documented or supported.

I mention all that in case you are interested in the background. None of the foregoing explains what we see in this mixed mode plan. The row mode Gather Streams elapsed time ought to include its children. The batch mode Sort should just be reporting its own elapsed time.

With that understanding in mind, there’s no way the Sort could run for longer than the Gather Streams. What’s going on here?

Read on for a Paul White-level discussion of the topic, including a demo from Erik Darling.

Comments closed