Press "Enter" to skip to content

Month: September 2024

Dealing with Collinearity using Lasso Regression

Vinod Chugani always moves in the same direction:

One of the significant challenges statisticians and data scientists face is multicollinearity, particularly its most severe form, perfect multicollinearity. This issue often lurks undetected in large datasets with many features, potentially disguising itself and skewing the results of statistical models.

In this post, we explore the methods for detecting, addressing, and refining models affected by perfect multicollinearity. Through practical analysis and examples, we aim to equip you with the tools necessary to enhance your models’ robustness and interpretability, ensuring that they deliver reliable insights and accurate predictions.

Read on to learn a bit more about how collinearity works and how you can use lasso regression (instead of ridge regression) to deal with the problem.

Leave a Comment

Point-in-Time Recovery with Postgres

Grant Fritchey restores some backups:

PostgreSQL has the capabilities to support backups as I already described in my first article on the topic. PostgreSQL also has the capability to restore to a point in time. However, that does require you to change the way you’re performing your backups. This article advances our understanding of how to better protect your PostgreSQL databases by expanding on the database backups and restores into a more full-blown disaster recovery process through point in time restores.

While the important part is the restore, in a classic chicken or egg conundrum, we can’t talk about restoring until we first have a backup, so I’ll start with how you need to backup your databases in preparation for a point in time restore.

Click through for the process and to see it in action.

Leave a Comment

Discerning a Star Schema from an Existing Report

Kelly Broekstra describes a common flow for business intelligence projects:

I have worked as a business intelligence developer for several years, and I’m always asked: “How do you convert user requirements to a functioning data model?”

I follow the Kimball methodology. For more information, check out the official pages.

But, here are some specific tips on what works for me.

Click through for those tips.

Leave a Comment

Microsoft Fabric Direct Lake and Reframing Operations

Reza Rad changes the frame:

Power BI offers a new type of connection to Microsoft Fabric Lakehouse or Warehouse, called Direct Lake. The Direct Lake connection acts like DirectQuery and won’t need the data to be refreshed. However, the Power BI semantic model has refresh settings that can be turned on or off. In this article and video, you will learn about the Refresh settings for the Power BI semantic model that is connected using a Direct Lake connection, what that is, and why it is called Reframe.

Read on to learn more, or to check out the video.

Leave a Comment

mssql-tools 18 and Two Common Errors

Vlad Drumea covers a pair of errors you might run into with mssql-tools version 18:

In this post I cover the 0A000086 and “command not found” errors that you might encounter with the new version of SQL Server command-line tools, namely sqlcmd and bcp, for Linux.

While the latest version of SQL Server command-line tools, based on Microsoft ODBC 18, brings improvements, it also brings some gotchas that can break your automations.

Read on to learn more about each.

Leave a Comment

The Brokenness of TABLESAMPLE

Paul White walks us through some issues:

Initial testing went well, which was a pleasant surprise. Soon enough though, errors started to appear in the tool’s output. That’s not entirely unexpected since ensuring consistent results under high concurrency tends to expose all sorts of niggly edge cases. It’s still an annoyance because debugging edge cases in trigger code can be tricky and laborious.

What was a surprise though was the nature of the error messages.

Read on for the full story. Paul has also created a feedback issue covering a problem with the function.

Leave a Comment

Microsoft Fabric Capacities and Reserved Instances

Marc Lelijveld shares an experience:

Last week, I had a situation in which a client wanted to purchase a reserved instance Fabric capacity. Me being me, I assumed it would be super straight forward to purchase through Azure. However, at some point I was lost in the process where the official documentation confused even more. In the end, I figured out and managed to get a capacity running based on the Reserved Instance pricing. I didn’t find any other blogs or articles describing this confusion or specific case. Therefore, I decided to write down my thoughts and findings in a blog.

This blog is not only relevant if you work with Microsoft Fabric, but also for anyone currently working with Power BI Premium. Given the deprecation of Power BI Premium capacities, you have to switch to Fabric capacities sooner or later.

Read on to learn more about the differences between pay-as-you-go and reserved instance capacities, the process to make a reservation, and what comes after that before you have a Microsoft Fabric capacity ready to go.

Leave a Comment

From Pandas to Polars

Ari Lamstein explains why it might be worth a switch:

I recently decided to switch from Pandas to Polars for my Python projects that use dataframes. I came to this decision while taking a workshop on Polars last week: I found its syntax to be so intuitive that I couldn’t justify continuing to try to get “better” at Pandas, despite Pandas being the more established library. The fact that Polars is faster (it’s main selling point) was, surprisingly, not a factor in my decision.

A similar transformation recently happened in R. For most of the history of R there was only one way to interact with dataframes: Base R. Then the Tidyverse came along, and offered both performance improvements and easier syntax. Eventually the Tidyverse became the primary way that many people interact with dataframes. I believe that the Tidyverse’s easier syntax is what led to its widespread adoption, and I think that something similar is likely to happen with Polars.

Click through for Ari’s thoughts on the matter. H/T R-Bloggers.

Leave a Comment

Indexing Vector Databases

Brendan Tierney continues a series on vector databases:

In this post on Vector Databases, I’ll explore some of the commonly used indexing techniques available in Databases. I’ll also explore the Vector Indexes available in Oracle 23c. Be sure to check that section towards the end of the post, where I’ll also include links to other articles in this series.

As with most data in a Databases, indexes are used for fast access to data. They create an organised structure (typically B+ tree) for storing the location of certain values within a table. When searching for data, if an index exists on that data, the index will be used for matching and the location of the records is used to quickly retrieve the data.

Read on to get an idea of what kinds of indexing techniques are useful in that space.

Leave a Comment

Data Compression and CPU Utilization

Kendra Little shares some advice:

Every time I share a recommendation to use data compression in SQL Server to reduce physical IO and keep frequently accessed data pages in memory, I hear the same concern from multiple people: won’t this increase CPU usage for inserts, updates, and deletes?

DBAs have been trained to ask this question by many trainings and a lot of online content – I used to mention this as a tradeoff to think about, myself– but I’ve found this is simply the wrong question to ask.

In this post I’ll share the two questions that are valuable to ask for your workload.

Kendra’s advice is very good, and to add my own two cents to the mix: the last place I was at did, in fact, see a pretty reasonable reduction in CPU utilization by performing page-level compression on any index where it made sense—and this was a very busy OLTP environment. The exceptions would be indexes making prominent use of things like Guids or chunks of binary, which don’t compress very well at all. In all my FTE and consulting years, I’ve never run into a circumstance in which compression caused a significant gain in CPU utilization.

Leave a Comment