Press "Enter" to skip to content

Day: November 27, 2023

Local Regression (LOESS) in R

Steven Sanderson takes us through a powerful regression technique:

LOESS, which stands for LOcal regrESSion, is a versatile and powerful technique for fitting a curve to a set of data points. Unlike traditional linear regression, LOESS adapts to the local behavior of the data, making it perfect for capturing intricate patterns in noisy datasets.

Click through for examples. LOESS works best with quadratic data, like in Steven’s last example image. The downside to it as a technique is that you can find spurious movement that may seem interesting but is just following the noise.

Comments closed

Data Warehouse ETL Patterns

Ben Johnston starts a new series:

No matter the ETL tool used, there are some basic patterns to follow when transferring data between systems. There are many data tools and platforms, but the basic patterns remain the same. This focuses on SQL Server, but most of these methods work in any data platform. Even if you are using a virtualization layer, you likely need to prepare the data before exposing it to that engine, which means ETL and data transfers.

Warehouse is very loosely a data warehouse, but the same process applies to other systems. This includes virtualization layers, and to a smaller degree, bulk transfers between transactional systems.

Read on for a few things Ben recommends you have in place before beginning the project, as well as several warehouse loading patterns.

Comments closed

Tracking Inaccessible Azure SQL DB Databases and Customer Key Cycling

Rod Edwards is watching:

This is the first follow up post from: Azure SQL TDE and Customer Keys (BYOK). Microsoft?…your name isn’t down, so you aren’t coming in. (sqlrod.com) , which explained how to use Customer Keys with Azure SQL DB (and Managed instance), and some of the dangerous pitfalls that you can face. We need to know when there may be trouble on the horizon, so key (pun fully intended) to this is monitoring.

Yes, i’m rambling on again about monitoring…but I like monitoring.

Be seeing you, Number Six.

Comments closed

Self-Join Optimizations and Index Intersection

Daniel Hutmacher shows off a possibility:

This blog post started as a “what if” contemplation in my head: Suppose you have a reasonably large table with a clustered index and a number of non-clustered indexes. If your WHERE clause filters by multiple columns covered by those non-clustered indexes, could it potentially be faster to rewrite that WHERE clause to use those non-clustered indexes?

The answer might surprise you.

To be honest, when I learned about the concept originally, I expected that there would be a great deal of use cases for it. But SQL Server rarely comes up with this answer on its own and I think that’s because in most scenarios, we’d need to do additional sorting or other expensive operations to get the multiple indexes aligned just right to make this the faster option.

Comments closed

Methods for Authenticating to PostgreSQL

Samab Tariq builds a list:

Authentication is the process of verifying the identity of a user or system attempting to access a database. In the realm of PostgreSQL, authentication serves as the first line of defense, ensuring that only authorized individuals or applications gain entry. As the gateway to sensitive data, robust authentication is imperative, safeguarding against unauthorized access and fortifying the foundation of data protection. In this blog, we delve into the significance of authentication in PostgreSQL, unraveling its critical role in securing valuable information.

PostgreSQL supports various authentication methods to secure access to its database. The exact methods available may depend on the version of PostgreSQL you are using, In this blog we have mentioned a few of the most used authentication methods in PostgreSQL

Read on for the listing and some ideas on how to use the various options.

Comments closed

Scheduling Fabric Capacity Pause/Resume with Azure Logic Apps

Soheil Bakhshi doesn’t want to forget to turn off the power at night:

In the previous blog post, I explained Microsoft Fabric capacities, shedding light on diverse capacity options and how they influence data projects. We delved into Capacity Units (CUs), pricing nuances, and practical cost control methods, including manually scaling and pausing Fabric capacity. Now, we’re taking the next step in our Microsoft Fabric journey by exploring the possibility of automating the pause and resume process. In this blog post, we’ll unlock the secrets to seamlessly managing your Fabric Capacity with automation that helps us save time and resources while optimising the usage of data and analytics workloads.

Right off the bat, this is a rather long blog, so I added a bonus section at the end for those who are reading from the beginning to the end. With that, let’s dive in!

To spoil the bonus a little bit, Soheil shows us not only how to turn things on and off on a schedule, but also how to ignore certain days of the week. Read the whole thing to get that.

Comments closed

Combining Window Functions and GROUP BY

Andy Brownsword aggregates some data:

We revisited window functions last week for T-SQL Tuesday. As we’re in that area there’s another example I thought was worth exploring. Can we group data whilst applying window functions in the same query?

Andy comes up with a final query that works perfectly fine, but there’s actually an easier answer in terms of code readability: the DISTINCT operator.

SELECT DISTINCT FinancialQuarter, QuarterAvg = AVG(SalesValue) OVER (PARTITION BY FinancialQuarter), YearAvg = AVG(SalesValue) OVER (PARTITION BY FinancialYear)
FROM #MonthlySales;

The FinancialQuarter column is unique so we can perform the window operation for averaging sales value over financial quarter and then by financial year. To remove the “duplicate” rows, we run DISTINCT and get the same results.

That said, the execution plan for this is a little more complex, as we have to go through a lazy spool on two separate occasions rather than the one that Andy’s solution comes up with. For sufficiently large datasets, that could make a difference, so as usual, choose the option that works better for your situation.

Comments closed