Press "Enter" to skip to content

Day: August 3, 2021

Working with Trees of Data in R

Martin Stingl shows off the data.tree package:

Lately I tried to visualize an hierarchy with Tableau Desktop. The problem was that the hierarchy had a variable depth because it was tree-based. Each row had an id and a parent_id. Normally hierarchies in Tableau are defined by pulling some fields together, such as product categoryproduct group and product id.

Handling tree-based hierarchies seems to be a lot more complex. I found a plugin at https://github.com/tableau/extension-hierarchy-navigator-sandboxed but this only works online.

So I asked myself how I can handle this using R. I found the R-package data.tree at https://github.com/gluc/data.tree. I want to describe how I use this package to preprocess my data.

Read on to see how this works and how you can turn a classical data representation of a tree (ID and parent ID) into a flattened structure with a fixed number of levels. H/T R-Bloggers.

Comments closed

Caching Function Results in an R Package

Maelle Salmon and Cristophe Dervieux show us ways to cache results of function calls using R:

Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.

As a quick note, this makes sense when writing functions, which are expressions without side effects. If you have side effects, caching might not give you what you expect.

Comments closed

Orchestrating ML Pipelines with Amazon Managed Workflows for Airflow

Juston Leto, et al, show off MLOps capabilities in AWS:

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.

This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Read on for a step-by-step tutorial on the process.

Comments closed

Tools for Data Modeling

Rayis Imayev shares information on a few database modeling tools:

A picture is worth a thousand words, the same way a visual database schema is better than a database model communicated by a multitude of data scripting text objects.

Someone may spend a significant amount of time trying to describe all the database tables’ attributes, constraints and relationships between tables with words while a visual Entity Relationship Diagram (ERD) may only take a few minutes (or seconds) to tell the same story.

There are different and similar tools available to help you to create these visual ERD artifacts. The more visual appeal those tools may demonstrate, the fewer efforts would be required to add new information by keyboard-typing. Visual drag-drop experience will prevail, and only to type explicit list of attributes/types/etc. of your data model entities, that’s where your keyboard-typing skills will still be necessary.

Click through for reviews of several tools, ranging from free to expensive.

Comments closed

Analytic Window Functions in SQL Server

Aveek Das takes a look at a few analytic window functions available since SQL Server 2012:

Since the introduction of SQL Server 2012, the analytic functions were added to the SQL Server database engine. Any version of SQL Server after SQL Server 2012 can execute analytic queries on it. These functions are used to calculate an aggregated value from the dataset but are based on a specific set of rows instead of the entire dataset. As compared to aggregate functions like SUM, COUNT, AVG, etc. which return scalar records, these functions can return multiple records based on the conditions. The most common examples of using these functions are to find moving averages, running totals, etc. SQL Server supports the following analytic functions.

1. CUME_DIST – Find the cumulative distribution of a numerical column

2. FIRST_VALUE – Finds the first value of a column from the group and prints the same for each row

3. LAST_VALUE – Finds the last value of a column from the group and prints the same for each row

4. LAG – Reads values after the specified number of rows for a column

5. LEAD – Reads values before the specified number of rows for a column

Click through for examples of how each works.

Comments closed

Finding Duplicate Images with Filetable

Louis Davidson continues a discussion of SQL Server’s filetable feature:

In my article “Filetables, Great For Business, Great For Hobbyist”, I implemented the basis of a hobby database I use for my Twitter accounts (@disneypicaday and @dollywoodp), where I post daily pictures of theme parks.

As my inventory of prepped pictures has grown, one thing has become a major pain. Duplicates. I have tens of thousands of pictures, and I scan them occasionally to add to my directories. Perhaps not surprisingly, the same pictures often get chosen a second time. I use a naming standard that integrates the files into my database, as well as for the copies to go out for a given day. So the second (and third and fourth) time I choose the same picture, it looks different, though it has the exact same bits.

There are some nice benefits to maintaining file metadata in a queryable format, though do read to the end to see the performance impact.

Comments closed

The Costs and Benefits of Dirty Reads

Chad Callihan explains what a dirty read is and does a cost-benefit analysis on it:

When you are not careful with your transaction isolation levels or you get sneaky with the NOLOCK hint, one problem you can encounter is a dirty read. Let’s look at a short example to demonstrate a dirty read.

In a vacuum, I’m not necessarily opposed to the idea of dirty reads because you can find legitimate cases in which they can be useful. In practice, I’m generally very much in opposition because of two reasons: first, Read Committed Snapshot Isolation eliminates the majority of those reasons; and second, because the misuse is almost always in the direction of over-use of NOLOCK hints.

Comments closed

Estimating Nonclustered Index Sizes with Powershell

Alex Stuart performs some calculations:

I recently encountered a requirement to estimate the size of (a lot of) nonclustered indexes on some very large tables due to not having a test box to create them on or the time to build one. I couldn’t find a script to do this, and as any programmer knows, laziness is the real mother of invention, so I wrote one.
This post summarises how we can calculate this and provides a Powershell function to do so.

I used Microsoft’s documentation as a basis on how to do it, but as it’s a wall of text that’s a little tricky to follow, I’ll go over the basics here. I’m only covering the leaf levels and non-MAX columns that would create LOB pages – I’ll explain why later.

Click through for the article and to see how Alex’s calculations play out.

Comments closed