Category: Data

Loading OpenStreetMap Data in Postgres

Published 2023-08-25 by Kevin Feasel

Ryan Lambert gets just the right amount of data:

Populating a PostGIS database with OpenStreetMap data is favorite way to start a new geospatial project. Loading a region of OpenStreetMap data enables you with data ranging from roads, buildings, water features, amenities, and so much more! The breadth and bulk of data is great, but it can turn into a hinderance especially for projects focused on smaller regions. This post explores how to use PgOSM Flex with custom layersets, multiple schemas, and osmium. The goal is load limited data for a larger region, while loading detailed data for a smaller, target region.

The larger region for this post will be the Colorado extract from Geofabrik. The smaller region will be the Fort Collins area, extracted from the Colorado file. The following image shows the data loaded in this post with two maps side-by-side. The minimal data loaded for all of Colorado is shown on the left and the full details of Fort Collins is on the right.

Click through for more details on these two examples.

Comments closed

Projecting Gas Bills with Excel

Published 2023-08-21 by Kevin Feasel

Jiri D. performs an estimate:

With a rising prices of utilities like gas and power and winter still being in progress it might be good to check your usage from time to time. With electricity and TOU meters, it is easy, retailer does that for you and you can check your consumption and projected bill at any time with a granularity of 30 minute interval (sometimes even less).
With gas (and water) it is trickier, those are being read manually every 2 – 3 months so you may be up for a surprise when the bill arrives.

I had a co-worker who tracked this stuff in meticulous detail over a period of several years, to the point where he knew exactly how much propane to buy in August (when prices are lowest) to get through winter with minimum waste.

Comments closed

Managing Database Test Data

Published 2023-07-18 by Kevin Feasel

Phil Factor maintains some tests:

When learning about relational databases, we all tend to use ‘toy’ databases such as Pubs, AdventureWorks, NorthWind, or ClassicModels. This is fine, but it is too easy to assume that one can then do real-world database development in the same way. You have your database full of data and just cut code that you then test. From a distance, it all seems so easy.

In fact, rapid and effective database development usually requires a much more active approach to data. You need to work out how to test your work as you go, and to test continuously. For that, you need appropriate data with the right characteristics, in the suitable quantity. You also need to plan how to ensure that, when you make changes to the database, or even minor changes to its settings, all business processes continue to work correctly. In Agile terms you need a test-first methodology, fast feedback loop, and iterative development. You should never cut some SQL Code and only then think to yourself “I wonder how I’ll be able to test this?“.

This is something I’ve historically been pretty lazy about, to my detriment. Phil does an outstanding job of making the case for why generating and working with your own test data (versus live data) is important, as well as categorizing the purposes of this test data and the types of tests you’ll want to have.

Comments closed

Generating Random Datasets with R

Published 2023-07-14 by Kevin Feasel

Tomaz Kastrun has been busy:

The datasetR helps you generate a random datasets for your R project. It provides a preset random list of values with different data types (interval, ordinal, nominal values). It includes also a function for imputing a NULL, NA or missing values.

Click through to see how it works, the types of random data you can generate, and more.

1 Comment

Data Governance and Microsoft Fabric

Published 2023-05-31 by Kevin Feasel

Matthew Roche digs deeper into data governance in Microsoft Fabric:

One of the most underappreciated benefits of Power BI as a managed SaaS data platform has been the “managed” part. When you create a report, dataset, dataflow, or other item in Power BI, the Power BI service knows everything about it. Power BI is the authoritative system for all items it contains, which means that Power BI can answer questions related to lineage (where does the data used by this report come from?) and impact analysis (where is the data in this dataset used?) and compliance (who has permissions to access this report?) and more.

If you’ve ever tried to authoritatively answer questions like these for a system of any non-trivial scope, you know how hard it is. Power BI has made this information increasingly available to administrators, through logs and APIs, and the community has built a wide range of free and paid solutions to help admins turn this information into insights and action. Even more excitingly, Power BI keeps getting better and better even as the newer parts of Fabric seem to be getting all of the attention.

Comments closed

Paper Review: Moving Fast with Broken Data

Published 2023-05-02 by Kevin Feasel

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. ~~Worst~~ Best decision of my life.

Comments closed

Postgres GIS Calculations by SRID

Published 2023-04-17 by Kevin Feasel

Ryan Lambert talks accuracy:

A common use case with PostGIS data is to calculate things, such as distances between points, lengths of lines, and the area of polygons. The topic of accuracy, or inaccuracy, with GEOMETRY data comes up often. The most frequent offenders are generic SRIDs such as 3857 and 4326. In some projects accuracy is paramount. Non-negotiable. On the other hand, plenty of projects do not need accurate calculations. Those projects often rely on relationships between calculations, not the actual values of the calculations themselves. If Coffee shop Y is 4 times further away than Coffee shop Z. I’ll often go to Coffee shop Z just based on that.

In most cases, users should still understand how significant the errors are. This post explores one approach to determine the how accurate (or not!) the calculations of a given SRID are in a particular region, based on latitude (North/South). The queries used in this post can be adjusted for your specific area.

Click through to see how much the choice of SRID can impact your results.

Comments closed

Migration Tool for Cosmos DB

Published 2023-04-12 by Kevin Feasel

Hasan Savran reviews a tool:

Migrating databases to a new database server is a big job. You need to have the right tools to make this process easy for everyone. Cosmos DB had a Data Migration Tool to move data from a bunch of sources. It was a great, free tool with an easy user interface. You didn’t have to be a developer to use this tool. Here is a screenshot of what it used to look like.

Unfortunately, the tool got old and did not get updated with upcoming SDK changes. It worked only in Windows environments too.

Click through for Hasan’s thoughts. I had a behind-the-scenes look at everything getting put together, in that I was supposed to help, got busy, and slinked away as really sharp people like Carey Payette and John Bowen drove the project to completion.

Comments closed

Building Sets with GENERATE_SERIES

Published 2023-04-04 by Kevin Feasel

Aaron Bertrand is a fan of this new function:

I have come across a lot of use cases for manufacturing rows on the fly, aside from the common goal of populating a large data set such as a numbers or calendar table. A few favorites include building sample data, pivoting an unknown number of columns, data extrapolation, and filling gaps in date or time ranges.

If you are on SQL Server 2022 or Azure SQL Database, or have been reading up on new features, you’ve likely heard about one of the better T-SQL enhancements: a new built-in function called GENERATE_SERIES. The syntax is straightforward – it accepts arguments for start and stop, and an optional argument to indicate step (in case you want to iterate by more than 1, or backwards):

Click through to see how performance for this compares to two methods we’ve used in the past to generate similar results.

Comments closed

Analyzing SQL Server Table Metadata

Published 2023-03-23 by Kevin Feasel

Barney Lawrence looks at details about a table:

For a while I’ve been building up a script that uses SQL Servers INFORMACTION_SCHEMA.COLUMNS table to create a query that applies aggregates to each column in a table and then formats the results to allow an at a glance summary of the shape of the data inside of it.

I showed some of the techniques used to build this in a session regarding metadata and data warehousing which I co-presented with Emma Dolling and Ruth Pearson at SQLBits. The one comment that I heard the most after this session was “I want that script”. I did promise to make it available more than a few times and so I’ve put together a github repository to hold it and other resources we built for the session.

Barney anticipated the most likely change I’d make while reviewing the script: APPROX_COUNT_DISTINCT() if you’re on SQL Server 2022 and dealing with a large table.

Comments closed