Press "Enter" to skip to content

Month: November 2021

Replacing p-values with Bootstrapped Confidence Intervals

Florent Buisson has an interesting post on avoiding p-value calculations:

And indeed, I worked with highly-skilled data scientists who had a very sharp understanding of statistics. But after years of designing and analyzing experiments, I grew dissatisfied with the way we communicated results to decision-makers. I felt that the over-reliance on p-values led to sub-optimal decisions. After talking to colleagues in other companies, I realized that this was a broader problem, and I set up to write a guide to better data analysis. In this article, I’ll present one of the biggest recommendations of the book, which is to ditch p-values and use Bootstrap confidence intervals instead.

I’m a committed Bayesian (or at least a Bayesian who should be committed—depends on who you ask), so I’d consider this a big step forward.

Comments closed

Monotonic Constraints on Random Forests

Michael Mayer has some interesting R and Python code for us:

On ML competition platforms like Kaggle, complex and unintuitively behaving models dominate. In this respect, reality is completely different. There, the majority of models do not serve as pure prediction machines but rather as fruitful source of information. Furthermore, even if used as prediction machine, the users of the models might expect a certain degree of consistency when “playing” with input values.

A classic example are statistical house appraisal models. An additional bathroom or an additional square foot of ground area is expected to raise the appraisal, everything else being fixed (ceteris paribus). The user might lose trust in the model if the opposite happens.

One way to enforce such consistency is to monitor the signs of coefficients of a linear regression model. Another useful strategy is to impose monotonicity constraints on selected model effects.

Certain types of regression algorithm make this easy, but random forest? Not so much. That’s where Michael steps in.

Comments closed

Creating a Clock with Powershell

Jeffery Hicks knows what time it is (it’s clobberin’ time):

I’ve published a new project to the PowerShell Gallery. This is something that I needed, and maybe you do as well. Even though I have the typical clock running in the Windows taskbar, I have an ultrawide monitor so it isn’t always easy to read. I had been running the xclock app from WSL which was nice. But realized what I really wanted was a WPF-based clock that would display a formatted date-time string on my desktop.

Now there’s a PSClock similar to xclock, and you can check it out.

Comments closed

Using ConcatenateX in Power BI

Reza Rad describes a DAX function:

It happens often in Power BI calculations and reports that you need to concatenate a list of values from a column. You can do this concatenation in Power Query or DAX. However, if the concatenation needs to be done dynamically. ConcatenateX is a very helpful DAX function to achieve such results. It is very helpful to understand what happens in the virtual tables in DAX too. In this article and video, I’ll explain what ConcatenateX is and how it works in Power BI and DAX.

Click through for a video, as well as a detailed explanation in blog post format.

Comments closed

Pokey Performance with EXISTS

Erik Darling reminds us that even good things can go bad:

Look, I really like EXISTS and NOT EXISTS. I do. They solve a lot of problems.

This post isn’t a criticism of them at all, nor do I want you to stop using them. I would encourage you to use them more, probably.

But there’s some stuff you need to be aware of when you use them, whether it’s in control-flow logic, or in queries.

But do read on to see a specific type of issue you can run into with a left semi join.

Comments closed

Dynamic Column Rename in Power BI with XMLA and TOM

Kristyna Hughes solves a problem:

For the TOM and XMLA experts, imagine this. Your customer wants to dynamically rename columns without using the Power BI Desktop and would prefer all existing report visuals not get broken by the new name. Impossible? Not with TOM, XMLA, and translations within Power BI.

If you’ve ever tried to change a column name in a Power BI source, you’ve likely run into this error on any visuals that contained the renamed column. And when you hit that “See Details”, it will tell you the column that you simply renamed is no longer available for your visual.

So how do we get around that?

Read on to see how.

Comments closed

When to Start Using a Database with R or Python

Roel Hogervorst thinks about data sizes in R and Python:

Your dataset becomes so big and unwieldy that operations take a long time. How long is too long? That depends on you, I get annoyed if I don’ t get feedback within 20 seconds (and I love it when a program shows me a progress bar at that point, at least I know how long it will take!), your boundary may lay at some other point. When you reach that point of annoyance or point of no longer being able to do your work. You should improve your workflow.

I will show you how to do some speedups by using other R packages, in python moving from pandas to polars, or leveraging databases. I see some hesitancy about moving to a database for analytical work, and that is too bad. Bad for two reasons, one: it is super simple, two it will save you a lot of time.

I definitely agree with Roel’s bottom line here. Granted, part of that is domain knowledge, but databases are extremely good at handling data and both languages have plenty of database accessibility.

One last tip, though: if you’re on the data science or data analytics track, learn SQL. Yes, libraries like dbplyr in R or ORMs in Python can cover up a lot, but that comes at a cost, typically in terms of performance. Building these skills will make your life considerably easier.

Comments closed

SQL Saturday Orlando Notes

Andy Warren reflects on hosting the only in-person SQL Saturday in the United States this year:

We held an in-person SQLSaturday here in Orlando last weekend (Oct 30th). We didn’t organize one last year, there was just too much risk and too much uncertainty, so it felt good to return to something close to normal this year, even in scaled back fashion. I’ve got a lot of notes to share about how we ran the event this year!

The journey started at the end of 2020. We wrote up our plan for 2021 knowing there were a lot of unknowns, but hoping things would improve enough to resume doing the things we used to do as a local group and that included organizing a SQLSaturday. As this year has progressed attendance at our virtual meetings dropped, as did our enthusiasm for having them. Enthusiasm matters a lot when it comes to volunteer work and while I know many of you like the virtual format, it’s just not what I want to do. That narrowed the option list to having an in-person SQLSaturday or not doing one at all, not a great range of choices.

Read on for a lot of details. I appreciate how transparent Andy has always been with respect to running events like this and if you’re thinking about a SQL Saturday in 2022, definitely read Andy’s post.

Also, the event was small, but it was really nice to get to see people I hadn’t seen in years, so thank you, Andy, for putting on the show.

Comments closed

Updates to Azure Synapse Link

Aria Jelinek outlines the value of Azure Synapse Link:

New as of Ignite 2021, customers can optimize queries by setting custom partitions for their Azure Cosmos DB analytical store using keys that are commonly used as query filters. This compacts and optimizes the analytical data written to the partitioned store, resulting in better query performance even when working with a high volume of update or delete operations.

Azure Synapse Link is also now available for Azure Cosmos DB serverless accounts, expanding the integration to cover data from workloads with bursts of traffic or uncertain traffic patterns.

This post mostly covers the Dataverse and Cosmos DB integrations rather than the integration with SQL Server 2022.

One the whole, I like Azure Synapse Link for Cosmos DB and will probably like it for SQL Server 2022—maybe even a bit more. It does simplify the ELT process by taking care of the E and handling the first half of the L (landing into a staging table). Though if data’s going into a dedicated SQL pool, I do hope the people doing this understand that dedicated SQL pools are intended for Kimball-style data warehousing scenarios and there can be a considerable performance (and therefore price) hit if you simply replicate a bunch of stuff without subsequent transformation.

Comments closed

Optimization Tips with Inline Table-Valued Functions

Itzik Ben-Gan continues a series on table expressions:

This is the thirteenth and last installment in a series about table expressions. This month I continue the discussion I started last month about inline table-valued functions (iTVFs).

Last month I explained that when SQL Server inlines iTVFs that are queried with constants as inputs, it applies parameter embedding optimization by default. Parameter embedding means that SQL Server replaces parameter references in the query with the literal constant values from the current execution, and then the code with the constants gets optimized. This process enables simplifications that can result in more optimal query plans. This month I elaborate on the topic, covering specific cases for such simplifications such as constant folding and dynamic filtering and ordering. If you need a refresher on parameter embedding optimization, go over last month’s article as well as Paul White’s excellent article Parameter Sniffing, Embedding, and the RECOMPILE Options.

This was a really good series. If you haven’t seen the entries, set aside some time and check it out.

Comments closed