Press "Enter" to skip to content

Category: Data

Parsing Text Fragments

Aaron Bertrand looks at a way of speeding up LIKE %Something% queries and builds a fragment table:

It’s clear that in this specific case – with an address column of nvarchar(60) and a max length of 26 characters – breaking up each address into fragments can bring some relief to otherwise expensive “leading wildcard” searches. The better payoff seems to happen when the search pattern is larger and, as a result, more unique. I’ve also demonstrated why EXISTS is better in scenarios where multiple matches are possible – with a JOIN, you will get redundant output unless you add some “greatest n per group” logic.

Read the whole thing.  If you’re interested in the concept, I recommend reading up on n-grams, like Alan Burstein’s series and this TechNet article on implementing N-Grams in SQL Server.

Comments closed

Why Care About The Data?

Rob Collie explains some of the benefits of Power BI with respect to its ability to dig into data problems:

Many problems/opportunities can ONLY be addressed by making thousands of small behavior changes, on an ongoing basis, out in the “trenches.”  In other words, there are some which CANNOT be addressed by a small number of smart decisions made at the top of the org.  Improved pricing behavior by the sales force for instance.  Improved quality of service by field technicians, for another.

How do you make that happen?  How do you influence a large number of people to behave differently on an ongoing basis, especially when they operate at a distance (both organizationally and physically) from you?

Fancy management consulting firms charge six-seven figures to do this, but with Power BI and/or Power Pivot, it’s no longer necessary to make that outlay.  We’ve seen organizations crack this nut with in-house resources, modest budgets, and short timelines – even though the impact is absolutely enterprise-wide.  It takes some thought and iteration, for sure, but primarily, you “just” need to build a good scorecard.

This is a non-technical post, but I like it a lot because it helps motivate decisions and gives you a good reason to investigate whether your tools are doing the work you need them to do.

Comments closed

Benford’s Law

Tomaz Kastrun is starting a series on fraud analysis and starts with Benford’s Law:

One of the samples Microsoft provided with release of new SQL Server 2016 was using simple logic of Benford’s law. This law works great with naturally occurring numbers and can be applied across any kind of problem. By naturally occurring, it is meant a number that is not generated generically such as a page number in a book, incremented number in your SQL Table, sequence number of any kind, but numbers that are occurring irrespective from each other, in nature (length or width of trees, mountains, rivers), length of the roads in the cities, addresses in your home town, city/country populations, etc. The law calculates the log distribution of numbers from 1 to 9 and stipulates that number one will occur 30% of times, number two will occur 17% of time, number three will occur 12% of the time and so on. Randomly generated numbers will most certainly generate distribution for each number from 1 to 9 with probability of 1/9. It might also not work with restrictions; for example height expressed in inches will surely not produce Benford function. My height is 188 which is 74 inches or 6ft2. All three numbers will not generate correct distribution, even though height is natural phenomena.

Tomaz includes SQL Server R Services code, so check it out.

Comments closed

Accidental DBAs

Charity Majors on the Accidental DBA phenomenon:

(OH RIGHT, WE WROTE A BOOK ABOUT THIS!!!)

My friend Laine and I are writing a book for people on the data side, called “Database Reliability Engineering“, which is aimed at generalist engineers who want to learn how to deal with data responsibly and effectively.

(Actually that’s a good point, I am supposed to be pitching this book! — which is really mostly Laine with a smidgen of me but it’s going to be super awesome.  Consider this your sales pitch.)

So first, as an accidental DBA, you should obviously buy this book  :).  Second: stateful services require a different mindset[*].  It’s cool that you are running your own databases!  But reading post mortems like this where the conclusion is “MongoDB sucks” makes me fucking grind my teeth.

The theme of the story is a Mongo upgrade gone south, but this is a post about principles.  And rainbows.

Comments closed

Reporting On Unstructured Data

Jen Underwood discusses a tool which helps provide structure to text documents:

Although there are a variety of ways to extract unstructured data from files, one tried-and-true, fast and simple approach is to use Datawatch Monarch. Years ago I used this tool when building Department of Defense digital contract reporting projects. At that time, the process to define data regions and extract unstructured data required a bit of field mapping experimentation. With the latest version of Monarch Auto Define, that process is intelligently automated today.

Looks like an interesting tool.

Comments closed

Making Wide World Importers Bigger

Koen Verbeeck wants bigger fact tables for Wide World Importers:

Microsoft released a new sample database a couple of months back: Wide World Importers. It’s quite great: not every (unnecessary feature) is included but only features you’d actually use, lots of sample scripts are provided and – most importantly – you can generate data until the current date. One small drawback: it’s quite tiny. Especially the data warehouse is really small. The biggest table, Fact.Order, has about 266,000 rows and uses around 280MB on disk. Your numbers may vary, because I have generated data until the current date (12th of August 2016) and I generated data with more random samples per day. So most likely, other versions of WideWorldImportersDW might be even smaller. That’s right. Even smaller.

260 thousand rows is nothing for a fact table.  I was hoping that the data generator would allow for a bigger range of results, from “I only want a few thousand records” like it does up to “I need a reason to buy a new hard drive.”  Koen helps out by giving us a script to expand the primary fact table.

Comments closed

SQL Saturday Attendee Distances

I have a long, long post on figuring out how far SQL Saturday attendees tend to drive:

Before I begin, allow me to perform the data science Airing of Grievances.  This is an important part of data analysis which most people gloss over, instead jumping right into the “clean up the dirty data” phase.  But no, let’s revel in its filth for just a few moments.

Despite my protestations and complaints, I think there are some reasonable conclusions.  If you need to look like you’re working for a couple of hours (or at least want to play around a bit with SQL and R), this is the post for you.

Comments closed

Fashion Design And Genetic Algorithms

Daragh Sibley, et al, discuss using genetic algorithms to help design clothing:

Recombination. Having selected a set of high performing blouses we can now consider how they should be recombined to form a new child. While a traditional genetic algorithm would stochastically search all combinations over many market generations, we can shortcut that process by algorithmically looking for features that have been historically preferred by our target client segment.

To achieve this, we find statistical regularities between the population of blouses’ attributes (or configurations of attributes) and client feedback. For instance, we can model the relationship between attributes of our existing blouses and client feedback via:

Genetic algorithms (and Koza-style genetic programming) have long been a favorite topic of mine.  Integrating GA with fashion was not something that came to mind, but is a very interesting solution.

Comments closed

Comparing Table Contents

Kenneth Fisher discusses methods for comparing table contents:

Roll your own
Ie write one T-SQL query to compare the structure of the two tables (this one will always be the same) and another to compare the data (new each time).

  • Benefits

    • Free.
    • You get practice in writing T-SQL (yes, I’m stretching there.)
  • Flaws

    • Can take quite a bit of time to write. (Don’t forget to check for NULLs separately.)

    • Mistakes happen.

    • Performance may not be spectacular.

Good comparison.  Check it out.

Comments closed