Press "Enter" to skip to content

Month: September 2024

Simple Outlier Detection and Removal in R

Steven Sanderson looks for oddities:

Outliers can significantly skew your data analysis results, leading to inaccurate conclusions. For R programmers, effectively identifying and removing outliers is crucial for maintaining data integrity. This guide will walk you through various methods to handle outliers in R, focusing on multiple columns, using a synthetic dataset for demonstration.

The techniques Steven uses are perfectly reasonable (though I like to use MAD from the median rather than standard deviations from the mean because MAD from the median doesn’t suffer from the sorts of endogeneity problems standard deviation does in a dynamic process). My primary warning would be to keep outliers in a dataset unless you know why you’re removing them. If you know the values were impossible or wrong—for example, a person who works 500 hours a week—that’s one thing. But sometimes, you get exceptional values out of an ordinary process, and those values are just as real as any other. I might have had a sequence in which I flipped a fair coin and it landed on heads 10 times in a row. It’s statistically very uncommon, but that doesn’t mean you can ignore it as a possibility or pretend it didn’t happen.

Comments closed

Tracking Python Packages in Snowflake

Kevin Wilkie takes a peek:

When working with one of the many modern computer languages that use libraries, one of the many things to be aware of – as a developer – is the version of the libraries available for your usage.

Since there are multiple languages in Snowflake that use libraries, let’s go over how to check out the versions that come installed and how to install one yourself.

Read on for those answers. Well, one answer and one conundrum.

Comments closed

Data Ingestion with Microsoft Fabric Copy Jobs

Reitse Eskens spends a bunch of time at the copier:

The copy job is essentially an abstraction of a pipeline reading data from the source system and writing the data into either a Lakehouse or a Warehouse. It really is ingesting data and nothing else. In my opinion that what copy data flows are meant to do and are very good at too.

The big challenge we all keep facing is how to create incremental loads. We have to build some sort of metadata database where we keep the latest ID, data or other column we use to discern the increment on. In our flow, we need to get that value, compare it against the source system and get the differences. The biggest task is to find out if records are deleted.

With the Copy Job, a large part of this task is taken out of your hands. The Copy Job has a configuration GUI (or wizard) that helps you out quite quickly. So let’s not waste anymore characters and dig in!

Read on to see how it works and its capabilities and limitations. The key question, as always, is whether your workload fits into the wheelhouse. If so, this sounds really useful. If not, it’s a proper struggle.

Comments closed

TLS 1.2 (or Later) in Azure SQL

Sakshi Gupta provides a public service announcement:

From November 1st, any Azure SQL server left with the “Select an option” or “NONE” setting (where “NONE” means no enforced minimum TLS version) will only allow connections using TLS 1.2 and TLS 1.3. Connections using TLS 1.0 or TLS 1.1 will be rejected. It is critical for all customers to configure their servers correctly and ensure that their client applications can operate with TLS 1.2 or higher.

Pretty much any SQL Server client or driver that Microsoft released from 2016 forwards will support TLS 1.2, so for most organizations, this should be as simple as enabling the option in development and ensuring applications connect as expected.

Comments closed

Vector Search in Oracle

Brendan Tierney shows a few queries:

It can be very straightforward to use Vectors using SQL. It’s just a simple SQL query, with some additional Vector related requirements. The examples given below are a collection of some simple examples. These aren’t my examples, but they come from either documentation or from other examples people have come up with. I’ve tried to include references back to the original sources for these, and if I’ve missed any or referred to the wrong people, just let me know and I’ll correct the links.

In my next post on Vector Databases, I’ll explore a slightly more complex data set. I’ll use the Wine dataset used in a previous post and Vector Search to see if I can find a suitable wine. Some years ago, I had posts and presentations on machine learning to recommend wine. Using Vector Search should give us better recommendations (hopefully)!

Click through for those queries.

Comments closed

Tips for Navigating the Support Ticket Process

Kendra Little shares several tips:

Asking Microsoft for support for SQL Server or Azure SQL is a lousy experience these days. This is true whether you are using a cheaper service tier or the more expensive support tier formerly known as “Premiere Support.” Either way, I’ve found the same issues: as the person requesting support, I must know a whole lot about the root cause of my problem and how to solve it, or my request will be dismissed with misinformation. I need to have data and metrics that back up my claims in order to get the ticket escalated to someone who can help, and I will need to provide those receipts three or four times. Once something is escalated to the Product Group, I may get a helpful response, but it will generally take a while. If I’m not engaged directly with the Product Group and the answer is being relayed through a lower support tier, it often won’t make much sense.

These issues don’t happen due to bad work ethics or personal failings of support workers. These are good humans, who are trying their best! The problem is worse, because it’s systemic.

Kendra’s specific advice is around Microsoft and the Azure SQL family of products (SQL Server, Azure SQL DB, Azure SQL Managed Instance) but the advice is sound for much more than that. This advice will help you out when dealing with the support organization for pretty much any large company.

Comments closed

Creating Custom Visuals in Power BI with DAX

Kurt Buhler shows one way to customize Power BI visuals:

When creating reports in Power BI, you regularly come across scenarios where you want to show data in a certain way that is not straightforward in the standard “core visuals”. Sometimes, the visualization needs to reflect particularities about an organization’s business processes or its data. Other times, you simply want a more creative design to visualize the data in the most effective way. In this article, we discuss what to do in these scenarios, describing one method where you can make custom visualizations by using a single DAX measure, and the caveats and limitations of this particular approach.

In this article, our objective is to produce a matrix visual with the Timeline SVG, like in the following diagram.

Read on for several methods to handle when Power BI doesn’t do something out of the box, and a focus on creating a custom visual via DAX measure with an SVG. But do read the whole thing, as Kurt explains why this isn’t necessarily a great method.

Comments closed

Tracking Procedure Execution Paths with sp_statement_completed

Grant Fritchey shows off an interesting use case for Extended Events:

Another discussion in Gothenburg (such a great group of people) around Extended Events lead to a (admittedly, tiny) idea for an additional use for sp_statement_completed.

The basics for sp_statement_completed are pretty straight forward. If you want to capture a procedure’s behavior, you use rpc_completed. If you want to know about the individual statements within the procedure, you use sp_statement_completed. Now, I don’t recommend turning this on across the board or without some good filtering in place because, it’s likely to generate quite a bit of data. However, it can be useful, including in the following manner.

Click through for a silly but clarifying example from Grant.

Comments closed