Author: Kevin Feasel

In Part 1 of this series, I showed how to use a Visitor pattern to walk through one or more T-SQL statements to identify a problematic pattern where NOLOCK hints are applied to the target of an update or delete. The method in my initial examples was very manual, though, and won’t scale if this problem might be widespread. We need to be able to automate collecting a potentially large number of statements across an entire environment, and then try to eliminate false positives without manual intervention.

Read on to see how you can take what Aaron wrote last time and make it scalable.

Comments closed

An Overview of Azure Purview

Published 2021-08-04 by Kevin Feasel

Angela Henry gives us an overview of Azure Purview:

Organizations are amassing more data than ever, yet it is getting more difficult for their employees to find that data and use it with confidence. What if there was a solution out there that not only told us what data sources we have, but could tell us how those data sources should be used, and who the stewards/producers of that data are? What if it could allow us to classify our data, and provided us insights into what our entire data estate looked like? It might sound like data nirvana, but it just might be possible with the newest Platform as a Service (PaaS) offering from Microsoft, Azure Purview.

In the first part of the series, Angela covers the basics and pricing, so check it out.

Comments closed

Installing Kubernetes on Docker Desktop

Published 2021-08-04 by Kevin Feasel

Joy George Kunjikkur takes another stab at Kubernetes:

This is 3rd or 4th time, I am learning Kubernetes (hereafter mostly refer as K8s short form) hands-on sessions. Every time I learn the kubectl command and its options, I forget as there were no chances to apply in the day job. Another mistake I did all those times was missing to post learning to this blog.
Hope this time I will get a chance to use it in the day job and not miss posting the Kubernetes learning on to this blog.
This post is very basic. The aim is to get started with the Kubernetes development environment using Docker Desktop. Below are the steps at a high level to get started. Detailed steps with videos are available on the internet.

Read on for some installation tips, including how to install the Kubernetes dashboard for a UI experience.

Comments closed

Caching Function Results in an R Package

Published 2021-08-03 by Kevin Feasel

Maelle Salmon and Cristophe Dervieux show us ways to cache results of function calls using R:

Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.

As a quick note, this makes sense when writing functions, which are expressions without side effects. If you have side effects, caching might not give you what you expect.

Comments closed

Working with Trees of Data in R

Published 2021-08-03 by Kevin Feasel

Martin Stingl shows off the data.tree package:

Lately I tried to visualize an hierarchy with Tableau Desktop. The problem was that the hierarchy had a variable depth because it was tree-based. Each row had an id and a parent_id. Normally hierarchies in Tableau are defined by pulling some fields together, such as product category, product group and product id.
Handling tree-based hierarchies seems to be a lot more complex. I found a plugin at https://github.com/tableau/extension-hierarchy-navigator-sandboxed but this only works online.
So I asked myself how I can handle this using R. I found the R-package data.tree at https://github.com/gluc/data.tree. I want to describe how I use this package to preprocess my data.

Read on to see how this works and how you can turn a classical data representation of a tree (ID and parent ID) into a flattened structure with a fixed number of levels. H/T R-Bloggers.

Comments closed

Orchestrating ML Pipelines with Amazon Managed Workflows for Airflow

Published 2021-08-03 by Kevin Feasel

Juston Leto, et al, show off MLOps capabilities in AWS:

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.
This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Read on for a step-by-step tutorial on the process.

Comments closed

Finding Duplicate Images with Filetable

Published 2021-08-03 by Kevin Feasel

Louis Davidson continues a discussion of SQL Server’s filetable feature:

In my article “Filetables, Great For Business, Great For Hobbyist”, I implemented the basis of a hobby database I use for my Twitter accounts (@disneypicaday and @dollywoodp), where I post daily pictures of theme parks.
As my inventory of prepped pictures has grown, one thing has become a major pain. Duplicates. I have tens of thousands of pictures, and I scan them occasionally to add to my directories. Perhaps not surprisingly, the same pictures often get chosen a second time. I use a naming standard that integrates the files into my database, as well as for the copies to go out for a given day. So the second (and third and fourth) time I choose the same picture, it looks different, though it has the exact same bits.

There are some nice benefits to maintaining file metadata in a queryable format, though do read to the end to see the performance impact.

Comments closed

Tools for Data Modeling

Published 2021-08-03 by Kevin Feasel

Rayis Imayev shares information on a few database modeling tools:

A picture is worth a thousand words, the same way a visual database schema is better than a database model communicated by a multitude of data scripting text objects.
Someone may spend a significant amount of time trying to describe all the database tables’ attributes, constraints and relationships between tables with words while a visual Entity Relationship Diagram (ERD) may only take a few minutes (or seconds) to tell the same story.
There are different and similar tools available to help you to create these visual ERD artifacts. The more visual appeal those tools may demonstrate, the fewer efforts would be required to add new information by keyboard-typing. Visual drag-drop experience will prevail, and only to type explicit list of attributes/types/etc. of your data model entities, that’s where your keyboard-typing skills will still be necessary.

Click through for reviews of several tools, ranging from free to expensive.

Comments closed

Analytic Window Functions in SQL Server

Published 2021-08-03 by Kevin Feasel

Aveek Das takes a look at a few analytic window functions available since SQL Server 2012:

Since the introduction of SQL Server 2012, the analytic functions were added to the SQL Server database engine. Any version of SQL Server after SQL Server 2012 can execute analytic queries on it. These functions are used to calculate an aggregated value from the dataset but are based on a specific set of rows instead of the entire dataset. As compared to aggregate functions like SUM, COUNT, AVG, etc. which return scalar records, these functions can return multiple records based on the conditions. The most common examples of using these functions are to find moving averages, running totals, etc. SQL Server supports the following analytic functions.
1. CUME_DIST – Find the cumulative distribution of a numerical column
2. FIRST_VALUE – Finds the first value of a column from the group and prints the same for each row
3. LAST_VALUE – Finds the last value of a column from the group and prints the same for each row
4. LAG – Reads values after the specified number of rows for a column
5. LEAD – Reads values before the specified number of rows for a column

Click through for examples of how each works.

Comments closed

Estimating Nonclustered Index Sizes with Powershell

Published 2021-08-03 by Kevin Feasel

Alex Stuart performs some calculations:

I recently encountered a requirement to estimate the size of (a lot of) nonclustered indexes on some very large tables due to not having a test box to create them on or the time to build one. I couldn’t find a script to do this, and as any programmer knows, laziness is the real mother of invention, so I wrote one.
This post summarises how we can calculate this and provides a Powershell function to do so.
I used Microsoft’s documentation as a basis on how to do it, but as it’s a wall of text that’s a little tricky to follow, I’ll go over the basics here. I’m only covering the leaf levels and non-MAX columns that would create LOB pages – I’ll explain why later.

Click through for the article and to see how Alex’s calculations play out.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30