2024-09-25 – Curated SQL

Simple Outlier Detection and Removal in R

Published 2024-09-25 by Kevin Feasel

Outliers can significantly skew your data analysis results, leading to inaccurate conclusions. For R programmers, effectively identifying and removing outliers is crucial for maintaining data integrity. This guide will walk you through various methods to handle outliers in R, focusing on multiple columns, using a synthetic dataset for demonstration.

The techniques Steven uses are perfectly reasonable (though I like to use MAD from the median rather than standard deviations from the mean because MAD from the median doesn’t suffer from the sorts of endogeneity problems standard deviation does in a dynamic process). My primary warning would be to keep outliers in a dataset unless you know why you’re removing them. If you know the values were impossible or wrong—for example, a person who works 500 hours a week—that’s one thing. But sometimes, you get exceptional values out of an ordinary process, and those values are just as real as any other. I might have had a sequence in which I flipped a fair coin and it landed on heads 10 times in a row. It’s statistically very uncommon, but that doesn’t mean you can ignore it as a possibility or pretend it didn’t happen.

Comments closed

Tracking Python Packages in Snowflake

Published 2024-09-25 by Kevin Feasel

Kevin Wilkie takes a peek:

When working with one of the many modern computer languages that use libraries, one of the many things to be aware of – as a developer – is the version of the libraries available for your usage.

Since there are multiple languages in Snowflake that use libraries, let’s go over how to check out the versions that come installed and how to install one yourself.

Read on for those answers. Well, one answer and one conundrum.

Comments closed

Data Ingestion with Microsoft Fabric Copy Jobs

Published 2024-09-25 by Kevin Feasel

Reitse Eskens spends a bunch of time at the copier:

The copy job is essentially an abstraction of a pipeline reading data from the source system and writing the data into either a Lakehouse or a Warehouse. It really is ingesting data and nothing else. In my opinion that what copy data flows are meant to do and are very good at too.

The big challenge we all keep facing is how to create incremental loads. We have to build some sort of metadata database where we keep the latest ID, data or other column we use to discern the increment on. In our flow, we need to get that value, compare it against the source system and get the differences. The biggest task is to find out if records are deleted.

With the Copy Job, a large part of this task is taken out of your hands. The Copy Job has a configuration GUI (or wizard) that helps you out quite quickly. So let’s not waste anymore characters and dig in!

Read on to see how it works and its capabilities and limitations. The key question, as always, is whether your workload fits into the wheelhouse. If so, this sounds really useful. If not, it’s a proper struggle.

Comments closed

The Limitations of TRY-CATCH in SQL Server

Published 2024-09-25 by Kevin Feasel

Brent Ozar tries to catch but lets it slip through his fingers:

If you’re using TRY/CATCH to do exception handling in T-SQL, you need to be aware that there are a lot of things it doesn’t catch. Here’s a quick example.

Let’s set up two tables – bookmarks, and a process log to track whether our stored proc is working or not:

Read on for the example.

Comments closed

Comparing Pandas to Koalas in Microsoft Fabric

Published 2024-09-25 by Kevin Feasel

Tomaz Kastrun does some testing:

Data engineering and even simple data wrangling functions in Fabric can make several tasks faster, when you know know, which package (language) to choose. By comparing Python Pandas with PySpark Pandas (Koalas), we will see that there are huge performance gains, when using correct language.

Click through for the demo.

Comments closed

TLS 1.2 (or Later) in Azure SQL

Published 2024-09-25 by Kevin Feasel

Sakshi Gupta provides a public service announcement:

From November 1st, any Azure SQL server left with the “Select an option” or “NONE” setting (where “NONE” means no enforced minimum TLS version) will only allow connections using TLS 1.2 and TLS 1.3. Connections using TLS 1.0 or TLS 1.1 will be rejected. It is critical for all customers to configure their servers correctly and ensure that their client applications can operate with TLS 1.2 or higher.

Pretty much any SQL Server client or driver that Microsoft released from 2016 forwards will support TLS 1.2, so for most organizations, this should be as simple as enabling the option in development and ensuring applications connect as expected.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Day: September 25, 2024

Simple Outlier Detection and Removal in R

Tracking Python Packages in Snowflake

Data Ingestion with Microsoft Fabric Copy Jobs

The Limitations of TRY-CATCH in SQL Server

Comparing Pandas to Koalas in Microsoft Fabric

TLS 1.2 (or Later) in Azure SQL