Press "Enter" to skip to content

Category: Data

Reducing the Cost of Delete Operations in SQL Server

Ben Johnston eats the elephant:

One of the first things you learn when working with SQL Server, and other transactional based SQL systems, is that set based operations perform best. If you are querying data, a cursor pulling individual rows doesn’t perform as well as a single query. It doesn’t matter if that cursor is on the client side or the server side. A set-based operation is more efficient, runs faster, locks less, and is generally better than submitting multiple queries.

This is also generally true with delete statements. This post covers the exceptions to that rule. Large delete statements impacting many rows and large amounts of data (millions of rows and many gigs of data) can actually have decreased performance. With transactional systems, such as SQL Server, each transaction follows the ACID standard. Part of that standard ensures that transactional statements either complete or roll back fully – partial transactions are not allowed. For a delete statement, that means that all of the rows specified by the delete are removed from the table, or none are removed and the data rolls back to the original state. The delete and rollback behavior must be predictable and consistent or the data could be left in a contaminated, unreliable state. Performing very large deletes can present some challenges and needs to be treated differently in production systems.

Read on for the reasoning behind this, as well as several techniques you can use and how they compare.

Leave a Comment

Using the Cake Dataset

Rasmus Baath bakes a cake:

Now that I’ve got my hands on the source of the cake dataset I knew I had to attempt to bake the cake too. Here, the emphasis is on attempt, as there’s no way I would be able to actually replicate the elaborate and cake-scientifically rigorous recipe that Cook followed in her thesis. Skipping things like beating the eggs exactly “125 strokes with a rotary beater” or wrapping the grated chocolate “in waxed paper, while white wrapping paper was used for the other ingredients”, here’s my version of Cook’s Recipe C, the highest rated cake recipe in the thesis:

Click through for the ingredients and instructions, as well as Rasmus’s results in the test that counts the most: the taste test. H/T R-Bloggers.

Comments closed

Finding the Cake Dataset’s Original Source

Rasmus Baath has done a good deed for all:

In statistics, there are a number of classic datasets that pop up in examples, tutorials, etc. There’s the infamous iris dataset (just type iris in your nearest R prompt), the Palmer penguins (the modern iris replacement), the titanic dataset(s) (I hope you’re not a guy in 3rd class!), etc. While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938)1”. For me, this raised a lot of questions: Why measure the breakage angle of chocolate cakes? Why was this data collected? And what were the recipes?

Read on as Rasmus unravels the mysteries of the cake dataset with the help of several others. H/T R-Bloggers.

Comments closed

Embracing the Boring Part of Data Governance

Nikki Kelly shares some thoughts on data governance:

Data Governance – you have heard the term a million times and not once has it driven excitement in to your heart. I’d like to spend the next few minutes changing that.

Data Governance is formally defined as “… a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.”

Boring.

Nikki makes a great point that the process may feel boring but the net results are critical.

Comments closed

The Importance of Data Retention Policies

Ed Pollack shares some great advice:

It is always an afterthought. New objects are created that start off small and current. New feature development takes over and the recently architected data structures become old news. Over time, data grows and suddenly a previously small table contains millions or billions of rows.

Is all that data necessary? How long should it be retained for? If there is no answer to this question, then the actuality may be “Forever”, or more honestly “No one knows for sure.”

Retention takes on many forms and this article dives into ways in which data can be managed over time to ensure that it is fast, accurate, and readily available.

We don’t tend to think about data retention in the development phase, but it’s an important consideration and thinking about it up-front might save you disk space headaches later.

Comments closed

Sample Data in Azure ML Designer

Tom LaRock shows us where the hidden data is:

Recently I was working inside of Azure ML Studio and wanted to browse the sample datasets provided. Except I could not find them. I *knew* they existed, having used them previously, but could not remember if that was in the original ML Studio (classic) or not.

After some trial and error, I found them and decided to write this post in case anyone else is wondering where to find the sample datasets. You’re welcome, future Tom!

Click through to see where those sample datasets are. And yeah, they don’t get updated that frequently. And that’s probably a good thing, as it means when you run the demo two years after someone created it, you’ll still get predictable results.

Comments closed

A Call for Quality

Kurt Buhler sounds the clarion call:

We have a quality problem, and it’s getting worse. It creates higher costs, hurts our productivity, and threatens our capability to achieve success. The problem: too often, we prioritize quicker results and newer features over lasting quality and consistency in the data and analytics solutions that we deliver. Too often, we don’t collect the right requirements, we don’t test, we don’t automate, and we rely on hope and heroism to save the day. The result: we’re besieged by issues, fighting constant battles against an avoidable enemy that we ourselves created.

This is a long article with a lot of depth to it. I think the topic is well worth thinking about, though it’s quite a challenge.

Comments closed

GPS Data in PostGIS

Ryan Lambert clues us in:

One of the key elements to using PostGIS is having spatial data to work with! Lucky for us, one big difference today compared to the not-so-distant past is that essentially everyone is carrying a GPS unit with them nearly everywhere. This makes it easy to create your own GPS data that you can then load into PostGIS! This post explores some basics of loading GPS data to PostGIS and cleaning it for use. It turns out, GPS data fr om nearly any GPS-enabled device comes with some… character. Getting from the raw input to usable spatial data takes a bit of effort.

This post starts with using ogr2ogr to load the .gpx data to PostGIS. Once the data is in PostGIS then we actually want to do something with it. Before the data is completely usable, we should spend some time cleaning the data first. Technically you can start querying the data right away, however, I have found there is always data cleanup and processing involved first to make the data truly useful.

Click through for an example of how it all fits together.

Comments closed

Random Number Generation in T-SQL

Andy Yun has a method:

This is a quick blog to “document” a T-SQL technique for generating random numbers. I’ve been using this for years, but don’t use it frequently enough to have it fully memorized. So whenever I do need it, I must constantly have to go look up whenever I need to use it.

Click through for Andy’s method. This will generate random numbers based on a uniform distribution: the likelihood of getting any value in the range is equal. If you want to build out some data that approximates a normal distribution, I have a blog post for that.

Comments closed

Operating the Data Wrangler in Microsoft Fabric Notebooks

Gilbert Quevauvilliers rustles up some data:

In this blog post I am going to show you an easy way to clean your data (which is often fixing data issues or mis-spelt data) using the new feature Launch Data Wranger using DataFrames

I had previously blogged about using Pandas data frames but this required extra steps and details, if you are interested in that blog post you can find it here: Did you know that there is an easy way to shape your data in Fabric Notebooks using Data Wrangler?

In this blog post I am going to show you how I cleaned up the data in my location column.

Read on for a demonstration of what you can do.

Comments closed