Press "Enter" to skip to content

Day: October 3, 2022

The Basics of Automating Data Cleaning

Vincent Granville provides some guidance:

To the junior data scientist, it looks like each new dataset comes with a new set of challenges. It seems that you can not automate data cleaning. To decisions makers and stakeholders, this problem is so remote to them that they don’t even know the amount of resources wasted on this. To them, it seems obvious that automation is the way to go, but they may underestimate the challenges. It is usually not a high priority in many organizations, despite how much money it costs.

Yet, there are at most a few dozens of issues that come with data cleaning. Not a few thousands, not a few hundreds. You can catalog them and address all of them at once with a piece of code. One that you can reuse each time when you face a new data set. I describe here the main issues and how to address them. Automating the data cleaning step can save you a lot of time, and eliminate boring, repetitive tasks to make your data scientists happier.

Click through for Vincent’s thoughts and recommendations.

Comments closed

Strategies for De-Functionizing Your T-SQL

Tom Zika continues the 12-step program for scalar function abusers:

In the previous posts, we have learned why Scalar Functions (UDFs) are bad for parallelism and performance and what the options are for their removal.

The only remaining question is where to start. You’ve probably guessed there isn’t one true way to approach this. So instead, I offer several strategies that you can combine.

Plus Tom has an offer for you at the bottom of the post: conversion from Scalar UDFs to Inline Table-Valued Functions, which tend to perform much better.

Comments closed

Premium Azure SQL DB Performance

Reitse Eskens is moving on up:

The standard tier starts at 125 DTU’s and goes up to 4000. DTU’s are made up from a magic mix of CPU, memory, read iops and write iops. An iop (Input Output oPeration) should be a 4kb (disk cluster size) read or write. 125 DTU translates to 500 Kb/sec up to 32.000 Kb/sec. As we’re used to datapages which are 8Kb in size, you could say these databases are able to pull 62 to 4.000 pages per second from disk. When there are simultaneous writes, you’ll share the performance. At least that’s my interpretation of the IOP. For the DTU part, I’m still struggling to get a good grip on what it exactly is, beyond the magic mix.

It’s also a good idea to compare this to what the Standard tier has to offer. The general data patterns look similar with respect to elbows but the magnitudes are quite different, with Premium P1 starting out around Standard S4 in the test for insertion but more like S3 for selects.

Comments closed

Interacting with Microsoft Graph API via Synapse

Paul Hernandez starts a new series:

In this and the next post I want to show you how to connect to the Microsoft Graph API, request some data, process it and store it in a database using Synapse Analytics. 

This first post presents a sample use case, briefly introduces the Graph API, how to create a linked service to it, and how to start querying data. In the next post a sample  Synapse pipeline will be described. The pipeline grabs some data and copies it into some target tables. Finally, I will create a sample query to showcase the newly imported data. 

Because there’s some potential confusion to people, Graph API is completely different from the idea of graph databases.

Comments closed

Creating Repeatable Test Data

Louis Davidson repeats himself:

In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an account into a table and a set of interests. I wanted then to be able to load a set of random data into edge, related one account to another (one follows the other), and then an account to a random set of interests.

In this article, I will discuss a few techniques I used, starting with the simplest method using ordering by NEWID(), then using RAND() to allow the code to generate a fixed set of data.

There’s a lot of code needed to do it but if it’s something you’ve got to do, that’s the cost of doing business.

Comments closed

Troubleshooting High Non-SQL CPU Utilization

Ajay Dwivedi finds out it wasn’t the database:

Since SQL Server is not a cheap application and would cost a lot of money if we need to scale it horizontally. So it is not a good practice to allow any other resource-consuming processes to run alongside with SQL Server.

In this blog, we are going to troubleshoot a high CPU scenario when the CPU issue is caused by non-SQL Server process. Let’s say, I receive a call from the monitoring team where the application team is complaining about a “slow” SQL Server.

Ajay first uses SQLMonitor to perform this troubleshooting and then shows how to do this without SQLMonitor.

Comments closed