Data – Page 5 – Curated SQL

Finding the Cake Dataset’s Original Source

Published 2024-01-29 by Kevin Feasel

Rasmus Baath has done a good deed for all:

In statistics, there are a number of classic datasets that pop up in examples, tutorials, etc. There’s the infamous iris dataset (just type iris in your nearest R prompt), the Palmer penguins (the modern iris replacement), the titanic dataset(s) (I hope you’re not a guy in 3rd class!), etc. While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938)¹”. For me, this raised a lot of questions: Why measure the breakage angle of chocolate cakes? Why was this data collected? And what were the recipes?

Read on as Rasmus unravels the mysteries of the cake dataset with the help of several others. H/T R-Bloggers.

Comments closed

Embracing the Boring Part of Data Governance

Published 2024-01-26 by Kevin Feasel

Nikki Kelly shares some thoughts on data governance:

Data Governance – you have heard the term a million times and not once has it driven excitement in to your heart. I’d like to spend the next few minutes changing that.

Data Governance is formally defined as “… a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.”

Boring.

Nikki makes a great point that the process may feel boring but the net results are critical.

Comments closed

The Importance of Data Retention Policies

Published 2024-01-22 by Kevin Feasel

Ed Pollack shares some great advice:

It is always an afterthought. New objects are created that start off small and current. New feature development takes over and the recently architected data structures become old news. Over time, data grows and suddenly a previously small table contains millions or billions of rows.

Is all that data necessary? How long should it be retained for? If there is no answer to this question, then the actuality may be “Forever”, or more honestly “No one knows for sure.”

Retention takes on many forms and this article dives into ways in which data can be managed over time to ensure that it is fast, accurate, and readily available.

We don’t tend to think about data retention in the development phase, but it’s an important consideration and thinking about it up-front might save you disk space headaches later.

Comments closed

Sample Data in Azure ML Designer

Published 2024-01-15 by Kevin Feasel

Tom LaRock shows us where the hidden data is:

Recently I was working inside of Azure ML Studio and wanted to browse the sample datasets provided. Except I could not find them. I *knew* they existed, having used them previously, but could not remember if that was in the original ML Studio (classic) or not.

After some trial and error, I found them and decided to write this post in case anyone else is wondering where to find the sample datasets. You’re welcome, future Tom!

Click through to see where those sample datasets are. And yeah, they don’t get updated that frequently. And that’s probably a good thing, as it means when you run the demo two years after someone created it, you’ll still get predictable results.

Comments closed

A Call for Quality

Published 2024-01-04 by Kevin Feasel

Kurt Buhler sounds the clarion call:

We have a quality problem, and it’s getting worse. It creates higher costs, hurts our productivity, and threatens our capability to achieve success. The problem: too often, we prioritize quicker results and newer features over lasting quality and consistency in the data and analytics solutions that we deliver. Too often, we don’t collect the right requirements, we don’t test, we don’t automate, and we rely on hope and heroism to save the day. The result: we’re besieged by issues, fighting constant battles against an avoidable enemy that we ourselves created.

This is a long article with a lot of depth to it. I think the topic is well worth thinking about, though it’s quite a challenge.

Comments closed

GPS Data in PostGIS

Published 2023-12-19 by Kevin Feasel

Ryan Lambert clues us in:

One of the key elements to using PostGIS is having spatial data to work with! Lucky for us, one big difference today compared to the not-so-distant past is that essentially everyone is carrying a GPS unit with them nearly everywhere. This makes it easy to create your own GPS data that you can then load into PostGIS! This post explores some basics of loading GPS data to PostGIS and cleaning it for use. It turns out, GPS data fr om nearly any GPS-enabled device comes with some… character. Getting from the raw input to usable spatial data takes a bit of effort.

This post starts with using ogr2ogr to load the .gpx data to PostGIS. Once the data is in PostGIS then we actually want to do something with it. Before the data is completely usable, we should spend some time cleaning the data first. Technically you can start querying the data right away, however, I have found there is always data cleanup and processing involved first to make the data truly useful.

Click through for an example of how it all fits together.

Comments closed

Random Number Generation in T-SQL

Published 2023-12-11 by Kevin Feasel

Andy Yun has a method:

This is a quick blog to “document” a T-SQL technique for generating random numbers. I’ve been using this for years, but don’t use it frequently enough to have it fully memorized. So whenever I do need it, I must constantly have to go look up whenever I need to use it.

Click through for Andy’s method. This will generate random numbers based on a uniform distribution: the likelihood of getting any value in the range is equal. If you want to build out some data that approximates a normal distribution, I have a blog post for that.

Comments closed

Operating the Data Wrangler in Microsoft Fabric Notebooks

Published 2023-12-01 by Kevin Feasel

Gilbert Quevauvilliers rustles up some data:

In this blog post I am going to show you an easy way to clean your data (which is often fixing data issues or mis-spelt data) using the new feature Launch Data Wranger using DataFrames

I had previously blogged about using Pandas data frames but this required extra steps and details, if you are interested in that blog post you can find it here: Did you know that there is an easy way to shape your data in Fabric Notebooks using Data Wrangler?

In this blog post I am going to show you how I cleaned up the data in my location column.

Read on for a demonstration of what you can do.

Comments closed

Exploring a Dataset for Microsoft Fabric Suitability

Published 2023-11-29 by Kevin Feasel

Eugene Meidinger continues a series on learning Microsoft Fabric:

This is week 1 where I try to take Magic the Gathering draft data to learn Microsoft Fabric. Check out week 0 for some reasoning why.

So, before I do anything else, I want to get a sense of the data I’m looking at to see if it’s suitable for this project. I download the data, and because it’s gzipped, I use 7-zip to open it up on windows 10, or Windows explorer on Windows 11. In either case, the first thing I notice is the huge size disparity. When compressed, it is a quarter of a gigabyte. Uncompressed, it’s about 10 GB. This tells us something.

Read on to learn more about the dataset and how Eugene tackled some of the exploratory data analysis.

I also agree completely with Eugene’s point about serendipity. Keeping your metaphorical eyes open will increase the likelihood that you’ll just happen upon something that can help you later, or something that serves a need you didn’t know you had. I used to wander around the library back in my university days because I didn’t know what I didn’t know about topics (that is, the “unknown unknown” quadrant), so I’d just pick up some books that caught my eye. Not all of them are hits, though enough were to make the strategy worthwhile.

Comments closed

SQL Server Data Import and Export via File

Published 2023-10-24 by Kevin Feasel

Ed Pollack opens an import-export business:

For the purposes of this article, we will focus solely on the task of moving a data set from one server to another. Topics such as ETL, ELT, data warehousing, data lakes, etc…are important and relevant to data movement, but out of scope for a focused discussion such as this.

Ed touches on why you might want to use files and then shares his recommendations for generating files from SQL Server data as well as importing data from flat files into SQL Server.

Comments closed

Category: Data