Category: Data

String Regularization and Tokenization in SQL Server

Published 2023-10-04 by Kevin Feasel

The Stack Exchange network logs a lot of web traffic – even compressed, we average well over a terabyte per month. And that is just a summarized cross-section of our overall raw log data, which we load into a database for downstream security and analytical purposes. Every month has its own table, allowing for partitioning-like sliding windows and selective indexes without the additional restrictions and management overhead. (Taryn Pratt talks about these tables in great detail in her post, Migrating a 40TB SQL Server Database.)
It’s no surprise that our log data is massive, but could it be smaller? Let’s take a look at a few typical rows. While these are not all of the columns or the exact column names, they should give an idea why 50 million visitors a month on Stack Overflow alone can add up quickly and punish our storage:

Click through for one technique Aaron has to tighten things up a bit.

Comments closed

Data Temperature in Microsoft Fabric

Published 2023-09-29 by Kevin Feasel

Marc Lelijveld breaks out the thermometer:

As part of Microsoft Fabric, a new storage mode to connect from Power BI to data in OneLake has been introduced. Direct Lake it makes to possible to use your data from OneLake in Power BI without taking an additional copy of the data. Where Direct Lake promises to deliver the performance of Import-mode with the real-time capabilities of Direct query, it is time to have a closer look how data gets loaded into memory and delving into the concept of data dictionary temperature.
In this blog I will explain when data gets loaded into memory, elaborate on how you can measure the dictionary temperature of your data and the effect of queries on the temperature.

Click through to see what affects this measure and how.

Comments closed

Built-In R Datasets

Published 2023-08-29 by Kevin Feasel

Adrian Tam continues a series on getting started in R:

The ecosystem in R contains not only the function libraries to help you perform statistical analysis but also the data library that gives you some famous datasets to test out your program. There are a lot of built-in datasets in R. In this post, you will:
Learn some of the built-in datasets
Know how to use these datasets
Let’s get started.

Most of these built-in sets are fairly small and able to help you illustrate a specific point.

Comments closed

Loading OpenStreetMap Data in Postgres

Published 2023-08-25 by Kevin Feasel

Ryan Lambert gets just the right amount of data:

Populating a PostGIS database with OpenStreetMap data is favorite way to start a new geospatial project. Loading a region of OpenStreetMap data enables you with data ranging from roads, buildings, water features, amenities, and so much more! The breadth and bulk of data is great, but it can turn into a hinderance especially for projects focused on smaller regions. This post explores how to use PgOSM Flex with custom layersets, multiple schemas, and osmium. The goal is load limited data for a larger region, while loading detailed data for a smaller, target region.
The larger region for this post will be the Colorado extract from Geofabrik. The smaller region will be the Fort Collins area, extracted from the Colorado file. The following image shows the data loaded in this post with two maps side-by-side. The minimal data loaded for all of Colorado is shown on the left and the full details of Fort Collins is on the right.

Click through for more details on these two examples.

Comments closed

Projecting Gas Bills with Excel

Published 2023-08-21 by Kevin Feasel

Jiri D. performs an estimate:

With a rising prices of utilities like gas and power and winter still being in progress it might be good to check your usage from time to time. With electricity and TOU meters, it is easy, retailer does that for you and you can check your consumption and projected bill at any time with a granularity of 30 minute interval (sometimes even less).
With gas (and water) it is trickier, those are being read manually every 2 – 3 months so you may be up for a surprise when the bill arrives.

I had a co-worker who tracked this stuff in meticulous detail over a period of several years, to the point where he knew exactly how much propane to buy in August (when prices are lowest) to get through winter with minimum waste.

Comments closed

Managing Database Test Data

Published 2023-07-18 by Kevin Feasel

Phil Factor maintains some tests:

When learning about relational databases, we all tend to use ‘toy’ databases such as Pubs, AdventureWorks, NorthWind, or ClassicModels. This is fine, but it is too easy to assume that one can then do real-world database development in the same way. You have your database full of data and just cut code that you then test. From a distance, it all seems so easy.
In fact, rapid and effective database development usually requires a much more active approach to data. You need to work out how to test your work as you go, and to test continuously. For that, you need appropriate data with the right characteristics, in the suitable quantity. You also need to plan how to ensure that, when you make changes to the database, or even minor changes to its settings, all business processes continue to work correctly. In Agile terms you need a test-first methodology, fast feedback loop, and iterative development. You should never cut some SQL Code and only then think to yourself “I wonder how I’ll be able to test this?“.

This is something I’ve historically been pretty lazy about, to my detriment. Phil does an outstanding job of making the case for why generating and working with your own test data (versus live data) is important, as well as categorizing the purposes of this test data and the types of tests you’ll want to have.

Comments closed

Generating Random Datasets with R

Published 2023-07-14 by Kevin Feasel

Tomaz Kastrun has been busy:

The datasetR helps you generate a random datasets for your R project. It provides a preset random list of values with different data types (interval, ordinal, nominal values). It includes also a function for imputing a NULL, NA or missing values.

Click through to see how it works, the types of random data you can generate, and more.

1 Comment

Data Governance and Microsoft Fabric

Published 2023-05-31 by Kevin Feasel

Matthew Roche digs deeper into data governance in Microsoft Fabric:

One of the most underappreciated benefits of Power BI as a managed SaaS data platform has been the “managed” part. When you create a report, dataset, dataflow, or other item in Power BI, the Power BI service knows everything about it. Power BI is the authoritative system for all items it contains, which means that Power BI can answer questions related to lineage (where does the data used by this report come from?) and impact analysis (where is the data in this dataset used?) and compliance (who has permissions to access this report?) and more.
If you’ve ever tried to authoritatively answer questions like these for a system of any non-trivial scope, you know how hard it is. Power BI has made this information increasingly available to administrators, through logs and APIs, and the community has built a wide range of free and paid solutions to help admins turn this information into insights and action. Even more excitingly, Power BI keeps getting better and better even as the newer parts of Fabric seem to be getting all of the attention.

Comments closed

Paper Review: Moving Fast with Broken Data

Published 2023-05-02 by Kevin Feasel

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. ~~Worst~~ Best decision of my life.

Comments closed

Postgres GIS Calculations by SRID

Published 2023-04-17 by Kevin Feasel

Ryan Lambert talks accuracy:

A common use case with PostGIS data is to calculate things, such as distances between points, lengths of lines, and the area of polygons. The topic of accuracy, or inaccuracy, with GEOMETRY data comes up often. The most frequent offenders are generic SRIDs such as 3857 and 4326. In some projects accuracy is paramount. Non-negotiable. On the other hand, plenty of projects do not need accurate calculations. Those projects often rely on relationships between calculations, not the actual values of the calculations themselves. If Coffee shop Y is 4 times further away than Coffee shop Z. I’ll often go to Coffee shop Z just based on that.
In most cases, users should still understand how significant the errors are. This post explores one approach to determine the how accurate (or not!) the calculations of a given SRID are in a particular region, based on latitude (North/South). The queries used in this post can be adjusted for your specific area.

Click through to see how much the choice of SRID can impact your results.

Comments closed