Month: August 2021

Working with Trees of Data in R

Published 2021-08-03 by Kevin Feasel

Martin Stingl shows off the data.tree package:

Lately I tried to visualize an hierarchy with Tableau Desktop. The problem was that the hierarchy had a variable depth because it was tree-based. Each row had an id and a parent_id. Normally hierarchies in Tableau are defined by pulling some fields together, such as product category, product group and product id.
Handling tree-based hierarchies seems to be a lot more complex. I found a plugin at https://github.com/tableau/extension-hierarchy-navigator-sandboxed but this only works online.
So I asked myself how I can handle this using R. I found the R-package data.tree at https://github.com/gluc/data.tree. I want to describe how I use this package to preprocess my data.

Read on to see how this works and how you can turn a classical data representation of a tree (ID and parent ID) into a flattened structure with a fixed number of levels. H/T R-Bloggers.

Comments closed

Caching Function Results in an R Package

Published 2021-08-03 by Kevin Feasel

Maelle Salmon and Cristophe Dervieux show us ways to cache results of function calls using R:

Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.

As a quick note, this makes sense when writing functions, which are expressions without side effects. If you have side effects, caching might not give you what you expect.

Comments closed

Orchestrating ML Pipelines with Amazon Managed Workflows for Airflow

Published 2021-08-03 by Kevin Feasel

Juston Leto, et al, show off MLOps capabilities in AWS:

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.
This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Read on for a step-by-step tutorial on the process.

Comments closed

Tools for Data Modeling

Published 2021-08-03 by Kevin Feasel

Rayis Imayev shares information on a few database modeling tools:

A picture is worth a thousand words, the same way a visual database schema is better than a database model communicated by a multitude of data scripting text objects.
Someone may spend a significant amount of time trying to describe all the database tables’ attributes, constraints and relationships between tables with words while a visual Entity Relationship Diagram (ERD) may only take a few minutes (or seconds) to tell the same story.
There are different and similar tools available to help you to create these visual ERD artifacts. The more visual appeal those tools may demonstrate, the fewer efforts would be required to add new information by keyboard-typing. Visual drag-drop experience will prevail, and only to type explicit list of attributes/types/etc. of your data model entities, that’s where your keyboard-typing skills will still be necessary.

Click through for reviews of several tools, ranging from free to expensive.

Comments closed

Analytic Window Functions in SQL Server

Published 2021-08-03 by Kevin Feasel

Aveek Das takes a look at a few analytic window functions available since SQL Server 2012:

Since the introduction of SQL Server 2012, the analytic functions were added to the SQL Server database engine. Any version of SQL Server after SQL Server 2012 can execute analytic queries on it. These functions are used to calculate an aggregated value from the dataset but are based on a specific set of rows instead of the entire dataset. As compared to aggregate functions like SUM, COUNT, AVG, etc. which return scalar records, these functions can return multiple records based on the conditions. The most common examples of using these functions are to find moving averages, running totals, etc. SQL Server supports the following analytic functions.
1. CUME_DIST – Find the cumulative distribution of a numerical column
2. FIRST_VALUE – Finds the first value of a column from the group and prints the same for each row
3. LAST_VALUE – Finds the last value of a column from the group and prints the same for each row
4. LAG – Reads values after the specified number of rows for a column
5. LEAD – Reads values before the specified number of rows for a column

Click through for examples of how each works.

Comments closed

Finding Duplicate Images with Filetable

Published 2021-08-03 by Kevin Feasel

Louis Davidson continues a discussion of SQL Server’s filetable feature:

In my article “Filetables, Great For Business, Great For Hobbyist”, I implemented the basis of a hobby database I use for my Twitter accounts (@disneypicaday and @dollywoodp), where I post daily pictures of theme parks.
As my inventory of prepped pictures has grown, one thing has become a major pain. Duplicates. I have tens of thousands of pictures, and I scan them occasionally to add to my directories. Perhaps not surprisingly, the same pictures often get chosen a second time. I use a naming standard that integrates the files into my database, as well as for the copies to go out for a given day. So the second (and third and fourth) time I choose the same picture, it looks different, though it has the exact same bits.

There are some nice benefits to maintaining file metadata in a queryable format, though do read to the end to see the performance impact.

Comments closed

The Costs and Benefits of Dirty Reads

Published 2021-08-03 by Kevin Feasel

Chad Callihan explains what a dirty read is and does a cost-benefit analysis on it:

When you are not careful with your transaction isolation levels or you get sneaky with the NOLOCK hint, one problem you can encounter is a dirty read. Let’s look at a short example to demonstrate a dirty read.

In a vacuum, I’m not necessarily opposed to the idea of dirty reads because you can find legitimate cases in which they can be useful. In practice, I’m generally very much in opposition because of two reasons: first, Read Committed Snapshot Isolation eliminates the majority of those reasons; and second, because the misuse is almost always in the direction of over-use of NOLOCK hints.

Comments closed

Estimating Nonclustered Index Sizes with Powershell

Published 2021-08-03 by Kevin Feasel

Alex Stuart performs some calculations:

I recently encountered a requirement to estimate the size of (a lot of) nonclustered indexes on some very large tables due to not having a test box to create them on or the time to build one. I couldn’t find a script to do this, and as any programmer knows, laziness is the real mother of invention, so I wrote one.
This post summarises how we can calculate this and provides a Powershell function to do so.
I used Microsoft’s documentation as a basis on how to do it, but as it’s a wall of text that’s a little tricky to follow, I’ll go over the basics here. I’m only covering the leaf levels and non-MAX columns that would create LOB pages – I’ll explain why later.

Click through for the article and to see how Alex’s calculations play out.

Comments closed

A Warning: VPCs and Distributed Database Platforms

Published 2021-08-02 by Kevin Feasel

Wade Trimmer takes us through a reason why you might not want to use VPC endpoints to separate applications from distributed database platforms:

AWS PrivateLink (also known as a VPC endpoint) is a technology that allows the user to securely access services using a private IP address. It is not recommended to configure an AWS PrivateLink connection with Apache Kafka or Apache Cassandra mainly due to a single entry point problem. PrivateLink only exposes a single IP to the user and requires a load balancer between the user and the service. Realistically, the user would need to have an individual VPC endpoint per node, which is expensive and may not work.
Using PrivateLink, it is impossible by design to contact specific IPs within a VPC in the same way you can with VPC peering. VPC peering allows a connection between two VPCs, while PrivateLink publishes an endpoint that others can connect to from their own VPC.

Read on to understand how this affects platforms like Cassandra and Kafka.

Comments closed

Fun with Excel Cube Functions

Published 2021-08-02 by Kevin Feasel

Chris Webb has a new series on Excel cube functions combined with Lambda helper functions. First up, Chris collects items from a set:

This example shows the fundamental problem that has always existed with CubeRankedMember though: in order to show all the items in a set you need to know how many items there are in advance, and populate as many cells with CubeRankedMember formulas as there are items. In this case see how the range B4:B6 contains the numbers 1, 2 and 3; these numbers are used in the formulas in the range C4:C6 to get the first, second and third items in the set.
If a fourth product was added to the table, however, it would not appear automatically – you would have to add another cell with another CubeRankedMember formula in it manually. I’ve seen some workarounds but they’re a bit hacky and require you to know what the maximum possible number of items in a set could ever be. Indeed that’s always been one of the key differences between cube functions and PivotTables: cube functions are static whereas PivotTables can grow and shrink dynamically when the data changes.

Read on for a better answer. Then, check out part 2, which covers generating a dynamic table using Excel cube functions and Lambda helpers:

In the first post in this series I showed how to use the new Excel Lambda helper functions to return an array containing all the items in a set. That isn’t very useful on its own, so in this post I’ll show you how to generate an entire dynamic table using Excel cube functions and Lambda helper functions.
In this post I’ll be using the same source data as in my previous post: a table containing sales data with just two columns.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31