Press "Enter" to skip to content

Category: Data

More Isn’t Better With Data Collection

Andy Leonard argues that more data is not better data:

The Problem I am Trying To Solve

Is more data better? In his 2012 book, Antifragile, Nassim Nicholas Taleb (fooledbyrandomness.com | @nntaleb) – the first data philosopher I encountered – states:

“The fooled-by-data effect is accelerating. There is a nasty phenomenon called ‘Big Data’ in which researchers have brought cherry-picking to an industrial level. Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.” – Nassim Nicholas Taleb, Antifragile, p. 416

According to Taleb, there’s a bias for error embedded in big data; more is not better, it’s worse. I’ve experienced this with business intelligence solutions and spoken about data quality in data warehouse solutions, saying:

“The ratio of good:bad data in a useless / inaccurate data warehouse is surprisingly high; almost always north of 95% and often higher than 99%.”

Taleb states more data includes a disproportionate amount of bad data, and that bigger data results in more spurious correlations. In other words, more is not better – it’s worse.

It’s an idea worth grappling with.  The other side of the argument is that for some problems, you won’t know what you need until you need it.

Comments closed

The Importance Of Action

Jesse Seymour has relaunched his blog and started with a controversial statement:

There is no value in data.

If you’re still here, then I am assuming you either a) believe I have a valid point, or b) just want to see how crazy I am for opening my new data blog with a post spouting the lack of value in data.  We’ll see which option is right by the end of the post because right now, I am not so sure which one is right and which one is wrong.  After all, if there is no value in data, why should companies hire data professionals and give them a pay check?

My long-form response is too long for this format, so the short response is that data requires context.  I agree that action is important, but the purpose of a data visualization professional is to provide information with the relevant context to assist decision-making.  It’s not that there’s no value in data or that action is everything; it’s a multi-faceted process, and the specific relevant data will depend upon the industry.  In professional sports, front offices certainly use accurate(-ish) metrics which show the worst performing players on the team because sports leagues are zero-sum games.  Finding out Fred in Accounts Receivable spends the most time at the coffeemaker each day (17 minutes instead of 12 minutes!) matters a lot less, so unless you’re doing a Taylor-style factory study—and if you are, I’ll have other words with you that also aren’t apropos here—it doesn’t rate high enough in the relative priority list.

Comments closed

Fuzzy Searches In SQL Server

Phil Factor wants fuzzy searches done inside the relational database:

Many times in the past, I’ve had arguments with members of the development teams who, when we are discussing fuzzy searches, draw themselves up to their full height, look dignified, and say that a relational database is no place to be doing fuzzy searches or spell-checking. It should, they say, be done within the application layer. This is nonsense, and we can prove it with a stopwatch.

We are dealing with data. Relational databases do this well, but it just has to be done right. This implies searching on well-indexed fields such as the primary key, and not being ashamed of having quite large working tables. It means dealing with the majority of cases as rapidly as possible. It implies learning from failures to find a match. It means, most of all, a re-think from a procedural strategy.

This is a very interesting article, as Phil’s tend to be.  I enjoy these types of solutions where it requires almost an inversion of mindset:  instead of writing code which understands the data you intended, writing simpler code which looks at intention-laden data.

Comments closed

Parsing Text Fragments

Aaron Bertrand looks at a way of speeding up LIKE %Something% queries and builds a fragment table:

It’s clear that in this specific case – with an address column of nvarchar(60) and a max length of 26 characters – breaking up each address into fragments can bring some relief to otherwise expensive “leading wildcard” searches. The better payoff seems to happen when the search pattern is larger and, as a result, more unique. I’ve also demonstrated why EXISTS is better in scenarios where multiple matches are possible – with a JOIN, you will get redundant output unless you add some “greatest n per group” logic.

Read the whole thing.  If you’re interested in the concept, I recommend reading up on n-grams, like Alan Burstein’s series and this TechNet article on implementing N-Grams in SQL Server.

Comments closed

Why Care About The Data?

Rob Collie explains some of the benefits of Power BI with respect to its ability to dig into data problems:

Many problems/opportunities can ONLY be addressed by making thousands of small behavior changes, on an ongoing basis, out in the “trenches.”  In other words, there are some which CANNOT be addressed by a small number of smart decisions made at the top of the org.  Improved pricing behavior by the sales force for instance.  Improved quality of service by field technicians, for another.

How do you make that happen?  How do you influence a large number of people to behave differently on an ongoing basis, especially when they operate at a distance (both organizationally and physically) from you?

Fancy management consulting firms charge six-seven figures to do this, but with Power BI and/or Power Pivot, it’s no longer necessary to make that outlay.  We’ve seen organizations crack this nut with in-house resources, modest budgets, and short timelines – even though the impact is absolutely enterprise-wide.  It takes some thought and iteration, for sure, but primarily, you “just” need to build a good scorecard.

This is a non-technical post, but I like it a lot because it helps motivate decisions and gives you a good reason to investigate whether your tools are doing the work you need them to do.

Comments closed

Benford’s Law

Tomaz Kastrun is starting a series on fraud analysis and starts with Benford’s Law:

One of the samples Microsoft provided with release of new SQL Server 2016 was using simple logic of Benford’s law. This law works great with naturally occurring numbers and can be applied across any kind of problem. By naturally occurring, it is meant a number that is not generated generically such as a page number in a book, incremented number in your SQL Table, sequence number of any kind, but numbers that are occurring irrespective from each other, in nature (length or width of trees, mountains, rivers), length of the roads in the cities, addresses in your home town, city/country populations, etc. The law calculates the log distribution of numbers from 1 to 9 and stipulates that number one will occur 30% of times, number two will occur 17% of time, number three will occur 12% of the time and so on. Randomly generated numbers will most certainly generate distribution for each number from 1 to 9 with probability of 1/9. It might also not work with restrictions; for example height expressed in inches will surely not produce Benford function. My height is 188 which is 74 inches or 6ft2. All three numbers will not generate correct distribution, even though height is natural phenomena.

Tomaz includes SQL Server R Services code, so check it out.

Comments closed

Accidental DBAs

Charity Majors on the Accidental DBA phenomenon:

(OH RIGHT, WE WROTE A BOOK ABOUT THIS!!!)

My friend Laine and I are writing a book for people on the data side, called “Database Reliability Engineering“, which is aimed at generalist engineers who want to learn how to deal with data responsibly and effectively.

(Actually that’s a good point, I am supposed to be pitching this book! — which is really mostly Laine with a smidgen of me but it’s going to be super awesome.  Consider this your sales pitch.)

So first, as an accidental DBA, you should obviously buy this book  :).  Second: stateful services require a different mindset[*].  It’s cool that you are running your own databases!  But reading post mortems like this where the conclusion is “MongoDB sucks” makes me fucking grind my teeth.

The theme of the story is a Mongo upgrade gone south, but this is a post about principles.  And rainbows.

Comments closed

Reporting On Unstructured Data

Jen Underwood discusses a tool which helps provide structure to text documents:

Although there are a variety of ways to extract unstructured data from files, one tried-and-true, fast and simple approach is to use Datawatch Monarch. Years ago I used this tool when building Department of Defense digital contract reporting projects. At that time, the process to define data regions and extract unstructured data required a bit of field mapping experimentation. With the latest version of Monarch Auto Define, that process is intelligently automated today.

Looks like an interesting tool.

Comments closed

Making Wide World Importers Bigger

Koen Verbeeck wants bigger fact tables for Wide World Importers:

Microsoft released a new sample database a couple of months back: Wide World Importers. It’s quite great: not every (unnecessary feature) is included but only features you’d actually use, lots of sample scripts are provided and – most importantly – you can generate data until the current date. One small drawback: it’s quite tiny. Especially the data warehouse is really small. The biggest table, Fact.Order, has about 266,000 rows and uses around 280MB on disk. Your numbers may vary, because I have generated data until the current date (12th of August 2016) and I generated data with more random samples per day. So most likely, other versions of WideWorldImportersDW might be even smaller. That’s right. Even smaller.

260 thousand rows is nothing for a fact table.  I was hoping that the data generator would allow for a bigger range of results, from “I only want a few thousand records” like it does up to “I need a reason to buy a new hard drive.”  Koen helps out by giving us a script to expand the primary fact table.

Comments closed