Press "Enter" to skip to content

Author: Kevin Feasel

Everyone’s Data Is Dirty

Chirag Shivalker hits the highlights on dirty data:

It might sound a bit abrupt, but clean data is a myth. If your data is dirty, so is everyone else’s. Enterprises are more than dependent on data these days, and it is going to stay the same in coming years. They need to collect data in order to analyze it, which necessarily will not be 100% clean, pristine, or perfect in nature.

Nearly all companies face the challenge of dirty data in the form of a lot of duplicates, incorrect fields, and missing values. This happens due to omnichannel data influx, followed by hundreds, if not thousands, of employees wrestling and torturing that data to derive professional outcomes and insights. Don’t forget that even the best of the data has that tendency to decay in few weeks.

The saying goes that any analytics project is about 80% data cleansing and feature extraction.  I’d say that number’s probably closer to 90-95%, and dirty data is a big part of that.

Comments closed

Query Store Plan Forcing: You Can’t Always Get What You Want

Kendra Little shows an example where trying to force a Query Store plan results in an oddity:

This is not considered a “failure”

When I check the Query Store DMVs, force_failure_count is 0. The last_force_failure_reason_desc is NONE.

Query Store didn’t fail to apply the narrow plan. Instead, it’s just deciding not to give it to me, now that I’ve forced that plan.

Seems kinda like an adolescent, doesn’t it?

The answer remains a bit of a mystery, but read on to see how Kendra troubleshoots this.

Comments closed

What To Do With A Database In Source Control

Ed Elliott with Database Source Control 102:

This post is for a specific type of person if you are:

  • New to source control
  • Are getting started on your path to the continuous delivery nirvana
  • Have been able to get your database into some sort of source control system
  • Have more than one person checking in code to the database source
  • You are unsure what yo do next

Then this post is for you!

This is a nice post with some next-steps for when you have a database in source control but aren’t quite sure what to do next.

Comments closed

Go Download SQL Operations Studio

The SQL Server Team has announced SQL Operations Studio:

We are excited to announce that SQL Operations Studio is now available in preview. SQL Operations Studio is a free, light-weight tool for modern database development and operations for SQL Server on Windows, Linux and Docker, Azure SQL Database and Azure SQL Data Warehouse on Windows, Mac or Linux machines.

Download SQL Operations Studio to get started.

It’s not SSMS, but it is cross-platform.  And I think that over time, it will end up being better than SSMS.

Comments closed

Defining Result Sets With ML Services

Dave Mason covers a pain point in SQL Server Machine Learning Services:

The example above is so simple, defining the RESULT SETS poses no problems. But what if the format of the output isn’t known at design time? R (or Python) might take the input data set and add, remove, or change columns conditionally. Further, the input data set might not even be known at design time. How would you define the RESULT SETS at run time?

WITH RESULT SETS needs a MAKE_A_GUESS or FIGURE_IT_OUT option. If there’s some other type of “easy button” for this, I haven’t found it.

It would be nice if the service could the ability to read the data frame columns and use those by default.

Comments closed

Using The GROUPING SETS Operator

Alfonso Hernandez goes into detail with what you can do with GROUPING SETS:

In T-SQL, you summarize data by using the GROUP BY clause within an aggregate query. This clause creates groupings which are defined by a set of expressions. One row per unique combination of the expressions in the GROUP BY clause is returned, and aggregate functions such as COUNT or SUMmay be used on any columns in the query. However, if you want to group the data by multiple combinations of group by expressions, you may take one of two approaches. The first approach is to create one grouped query per combination of expressions and merge the results using the UNION ALLoperator. The other approach is to use the GROUPING SETS operator along with the GROUP BY clause and define each grouping set within a single query.

In this article I’ll demonstrate how to achieve the same results using each method.

Mastering GROUPING SETS makes reporting queries in T-SQL so much more effective.

Comments closed

Thoughts On Dynamic Data Masking

Kellyn Pot’vin-Gorman talks about Dynamic Data Masking:

Anyone with the unmask privilege or DB_OWNER will be able to view the data.  As many development and testing environments grant higher privileges to the users and in SQL Server, it’s not rare for a developer to be the DB_OWNER, (I used to come across this all the time when recoveries were performed by the wrong OS user) this leaves this data still quite vulnerable.  I do like that if you were to take a backup and recover it with masking, the obfuscated data is what is recovered physically.  I’m more concerned about those odd environments where compliance hasn’t been put in place on owners of the database that would still view the originally masked data, but unmasked.

Performance isn’t impacted, (i.e. no referential integrity concerns or execution plans) as the optimizer  performs all steps against the real data, which leads me to wonder what happens with some of the newer monitoring tools that state they can display SQL and bind variable data without accessing the database directly.  Would they “sniff” the masked data or unmasked?  Would it matter who the OS User or roles in the database?

The important thing here is that DDM isn’t really a security product.  It’s a something-or-another product that might be useful to stop shoulder surfing but pretty much nothing else.

1 Comment

Interpreting P-Value Histograms

David Robinson visualizes and interprets different p-value histograms:

So you’re a scientist or data analyst, and you have a little experience interpreting p-values from statistical tests. But then you come across a case where you have hundreds, thousands, or even millions of p-values. Perhaps you ran a statistical test on each gene in an organism, or on demographics within each of hundreds of counties. You might have heard about the dangers of multiple hypothesis testing before. What’s the first thing you do?

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values. Unfortunately, for some reason, this basic and simple task rarely gets recommended (for instance, the Wikipedia page on the multiple comparisons problem never once mentions this approach). This graph lets you get an immediate sense of how your test behaved across all your hypotheses, and immediately diagnose some potential problems. Here, I’ll walk you through a basic example of interpreting a p-value histogram.

It’s a fun read and informative as well.

Comments closed

Microservices With Kafka Streams

Ben Stopford walks us through a microservices architecture built on top of Kafka:

So we can use the Kafka Streams API to piece together complex business systems as a collection of asynchronously executing, event-driven services. The differentiator here is the API itself, which is far richer than, say, the Kafka Producer or Consumer. It makes code more readable, provides reusable implementations of common patterns like joins, aggregates, and filters and wraps the whole ecosystem with a transparent level of correctness.

Systems built in this way, in the real world, come in a variety of guises. They can be fine grained and fast executing, completing in the context of an HTTP request, or complex and long-running, manipulating the stream of events that map a whole company’s business flow. This post focusses on the former, building up a real-world example of a simple order management system that executes within the context of a HTTP request, and is entirely built with Kafka Streams. Each service is a small function, with well-defined inputs and outputs. As we build this ecosystem up, we will encounter problems such as blending streams and tables, reading our own writes, and managing consistency in a distributed and autonomous environment.

This post stays high-level and covers a lot of ground.  I’m wishy-washy on the idea of microservices, but if you are going to do them, it’s better to do them right.

Comments closed

A Compendium Of R Errors

Sumendar Karupakala has a bunch of errors you might find in R, as well as their explanations and fixes:

#pull out the animals which are dogs
animaldata[animaldata$Animal.Type == “Dog” ] # throuws an error
Error in `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”): undefined columns selected
Traceback:
1. animaldata[animaldata$Animal.Type == “Dog”]
2. `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”)
3. stop(“undefined columns selected”)
In [8]:
#fixed error
animaldata[animaldata$Animal.Type == “Dog”, ] # missedout comma with in the bracket

Some of it is basic syntax; others are a bit nastier.

Comments closed