Press "Enter" to skip to content

Author: Kevin Feasel

A Forensic Accounting Case Study

I have a new series I’ve started on applying forensic accounting techniques as a data platform specialist:

Before I dig into my case study, I want to make it absolutely clear that these techniques will help you do a lot more than uncover fraud in your environment. My hope is that there is no fraud going on in your environment and you never need to use these tools for that purpose.

Even with no fraud, there is an excellent reason to learn and use these tools: they help you better understand your data. A common refrain from data platform presenters is “Know your data.” I say it myself. Then we do some hand-waving stuff, give a few examples of what that entails, and go on to the main point of whatever talks we’re giving. Well, this series is dedicated to knowing your data and giving you the right tools to learn and know your data.

This first post sets the scene, with subsequent posts getting into detail on the technical aspects.

Comments closed

Understanding Key Lookups

Monica Rathbun explains what a key lookup is in SQL Server:

One of the easiest things to fix when performance tuning queries are Key Lookups or RID Lookups. The key lookup operator occurs when the query optimizer performs an index seek against a specific table and that index does not have all of the columns needed to fulfill the result set. SQL Server is forced to go back to the clustered index using the Primary Key and retrieve the remaining columns it needs to satisfy the request. A RID lookup is the same operation but is performed on a table with no clustered index, otherwise known as a heap.  It uses a row id instead of a primary key to do the lookup.

As you can see these can very expensive and can result in substantial performance hits in both I/O and CPU. Imagine a query that runs thousands of times per minute that includes one or more key lookups. This can result in tremendous overhead which is generated by these extra reads it effects the overall engine performance.

Monica’s absolutely right: key lookups can take a decent query and make it into a performance hog.

Comments closed

Finding and Removing Bad Characters Using Tally Tables

Louis Davidson shows how you can use a tally table to find and remove invalid characters from strings:

Now, the idea is that we will join the Application.People table to the Numbers table for a number of rows. We will do this for all of the numbers that are from 1 to the length of the name. Then use that value to get the substring of the value for that 1 character. I also include the Unicode value in the output to allow for some case sensitive operations, since UNICODE(‘a’) <> UNICODE(‘A’).

This is an example of how powerful tally tables can be.

Comments closed

Using Windows Authentication on Non-Windows Devices

Drew Furgiuele shows us how to connect to SQL Server using Windows Authentication if you’re not coming from a Windows device:

SQL Server supports different kinds of authentication mechanisms and protocols: the older NTLM protocol, and Kerberos. A lot of people cringe when you mention Kerberos because, well, Kerberos is hard. It’s arcane, it’s complex, and it’s hard to even describe unless you use it on the regular.

Simply put, it’s a ticketing and key system: you, a user, requests a ticket from a store, usually by authenticating to it via a username and password. If you succeed, you get a ticket that get stored within your local machine. Then, when you want to access a resource (like a SQL Server), the client re-ups with the store you got your initial ticket from (to make sure it’s still valid), and you get a “key” to access the resource. That key is then forwarded onto the resource, allowing you to access the thing you were trying to connect to. It’s way, way more complex than this, with lots of complicated terms and moving parts, so I’m doing a lot of hand-waving, but that’s the core of the system. If that kind of stuff excites you, go Google it, and I promise you’ll get more than you ever bargained for.

Kerberos is a scary beast to me, mostly because I don’t spend enough time working directly with it.

Comments closed

Approaches to Deleting Data in Batches

Andy Mallon shares a couple approaches to deleting data in batches:

In this scenario, we’re going to keep the data for X days after it’s created. Then we delete it. That’s it. X could be 3 days or 3 years–it doesn’t matter, we’ll follow the same design pattern.

In today’s world, we generate loads of log data, sensor data, telemetry data, etc. All that data is super duper valuable. But only for a while. Eventually, all that granular data becomes less useful, and isn’t worth keeping around. Maybe it gets aggregated, summarized, or maybe it just gets thrown out.

You’ll have a lot of data with more complex requirements, but I think you’ll also be surprised at how much data has simple date-based retention based on it’s creation.

Also read the comments, as they include additional techniques.

Comments closed

Standard and Non-Standard Evaluation in R

John Mount explains Standard Evaluation versus Non-Standard Evaluation in R:

In standard (or value oriented evaluation) code you type in is taken to be variable names, functions, names, operators, and even numeric literal values. String values or literals need extra marks, such as quotes.

John walks us through several examples along the way. At the end, John is a major proponent of Standard Evaluation over Non-Standard Evaluation.

Comments closed

Confluent Platform 5.2 Released

Mau Barra announces Confluent Platform 5.2:

Confluent Platform 5.2 represents a significant milestone in our efforts across three key dimensions:
1. It allows you to use the entire Confluent Platform free forever in single-broker Kafka clusters, so you are freer than ever to start building new event streaming applications right away. We are also bringing librdkafka 1.0 in order to bring our C/C++, Python, Go and .NET clients closer to parity with the Java client.
2. It adds critical enhancements to Confluent Control Center that will help you meet your event streaming SLAs in distributed Apache Kafka environments at greater scale.
3. With our latest version of Confluent Replicator, you can now seamlessly stream events across on-prem and public cloud deployments.

The top item is quite interesting: a free developer license and not just a 30-day trial.

Comments closed

DW Databases in PolyBase

I look at some databases people tend to ignore:

Today is a fairly short post covering a trio of databases you might not even know you have: DWConfiguration, DWDiagnostics, and DWQueue. The PolyBase installer drops all three of these on your instance. Let’s go in ascending order of the number of useful tables.

There are very few useful (to us) tables when using on-prem SQL Server as opposed to APS, but there are a few of note.

Comments closed

Managing Kubernetes Resources

Vincent-Philippe Lauzon takes us through some thoughts on Kubernetes resource allocation:

In this article we will look at how to inform Kubernetes about pods’ resources and how we can optimize for different scenarios.

A scenario that typically comes up is when a cluster has a bunch of pods where a lot of them are dormant, i.e. they don’t consume CPU or memory. Do we have to carve them a space they won’t use most of the time? The answer is no. As usual, it’s safer to provision capacity for a workload than relying on optimistic heuristic that not all workloads will require resources at the same time. So, we can configure Kubernetes optimistically or pessimistically.

Read the whole thing.

Comments closed

Using the Cosmos DB Change Feed

Hasan Savran (who just became a Microsoft MVP, so congrats to him) takes us through the Cosmos DB Change Feed:

Azure Cosmos DB Change Feed exposes Cosmos DB Logs to outside of CosmosDB. CosmosDB notifies you immediately when there is any change in your database. It supports all Inserts and Updates, Delete will be available soon. You can always use soft delete to catch delete events if you need to.

     By knowing what is changed in your database, you can trigger all kind of events and you can make your application work very smart. SQL Server has similar functionality but like many other features Log shipping is usually blocked by DBAs or the company policies. In CosmosDB, you don’t need to do anything to enable Change Feed feature! It’s already enabled, all you need to do is to configure it. Easiest way to catch change feed events is Azure Functions.

When I hear someone describe the change feed, I immediately imagine it as a Kafka topic.

Comments closed