Press "Enter" to skip to content

Category: Data

The Intersection Of Multiple Averages Is The Empty Set

Adi Gaskell argues that we shouldn’t get too wrapped up in “average” behaviors:

I’ve written extensively about the tremendous potential for big data in healthcare to drive enormous changes in how we keep people healthy for longer. It goes without saying however that all data is not created equal, and just having a large sample is not always sufficient to get the best insights.

If we needed reminding, a reminder comes via a recent study from the University of California, Berkeley. It suggests that things like emotion, behavior, and physiology vary hugely between individuals, therefore having an average over a large dataset can still produce a ‘norm’ that is wide of the mark for individuals.

“If you want to know what individuals feel or how they become sick, you have to conduct research on individuals, not on groups,” the researchers say. “Diseases, mental disorders, emotions, and behaviors are expressed within individual people, over time. A snapshot of many people at one moment in time can’t capture these phenomena.”

Variance is important.

Comments closed

Finding The Real Character Set: Unicode And SQL Server Identifiers

Solomon Rutzky wraps up his series on Unicode and regular identifiers:

The question that I’m trying to answer is: what are the valid “letters” and “decimal numbers” from other national scripts?

I tried using the online research tool “UnicodeSet”, but that gave slightly different results compared (using the “alphabetic” and “numeric_type = decimal” properties) to what I discovered SQL Server actually accepts.

I then loaded the actual Unicode 3.2 data files only to find that the number of characters having either the “alphabetic” or “numeric_type = decimal” properties was different than both the online search and what SQL Server actually accepts.

And so…..

Click through to find the real Unicode killer.

Comments closed

Execution Plans And GDPR

Grant Fritchey isn’t crazy when it comes to execution plans:

Now, when you save an execution plan out to a file, you’re potentially transmitting PI data. It goes further. When you hard code values, PI is not just in the query. Those PI values can also be stored throughout the plan in various properties.

So now you see what I mean when I say that the GDPR affects how we deal with execution plans. I’m not done yet.

Unfortunately, questions like the one Grant raises here won’t be answered until we see a few test cases in the European courts.

Comments closed

Everyone’s Data Is Dirty

Chirag Shivalker hits the highlights on dirty data:

It might sound a bit abrupt, but clean data is a myth. If your data is dirty, so is everyone else’s. Enterprises are more than dependent on data these days, and it is going to stay the same in coming years. They need to collect data in order to analyze it, which necessarily will not be 100% clean, pristine, or perfect in nature.

Nearly all companies face the challenge of dirty data in the form of a lot of duplicates, incorrect fields, and missing values. This happens due to omnichannel data influx, followed by hundreds, if not thousands, of employees wrestling and torturing that data to derive professional outcomes and insights. Don’t forget that even the best of the data has that tendency to decay in few weeks.

The saying goes that any analytics project is about 80% data cleansing and feature extraction.  I’d say that number’s probably closer to 90-95%, and dirty data is a big part of that.

Comments closed

The GDPR And You

William Brewer has some Q&A regarding the General Data Protection Regulation:

The General Data Protection Regulation (GDPR) will affect organisations in countries around the world, not just those in Europe. The GDPR regulates how personal data is stored, moved, handled, and destroyed. Not following the regulation will lead to dire consequences for your organisation. As a data professional or developer, you may have many questions and might be wondering how it will affect the way you will do your job. William Brewer answers common questions about the GDPR that you were too shy to ask.

Grant Fritchey summarizes the rules:

Ever heard of the General Data Protection Regulation? If not, go and read the Wiki. I’ll wait.

I can already hear what you’re thinking. “Grant, this doesn’t apply to me because my company is in the <insert non-EU country here>.” How do I know you’re thinking that? Because every single person with whom I’ve brought this up has had the same response. You might want to go back and re-read it.

As a data professional, you’re going to want to know about this regulation.

Comments closed

Master Data In Azure

Matt How explains why Master Data Services isn’t a great cloud-based master data management solution and offers up an alternative:

Excel is easy to use, but not user friendly

Excel is on nearly every desktop in any Windows based organisation and with the Master Data Services Add-in, it puts the data well within the reach of the users. Whilst it is simple it is in no way user friendly when compared to other applications that your users may be using. Not to mention that for most this will be the only part of the solution they see! Wouldn’t it be great if there was a way to supply the same data but with an intuitive, mobile ready front end that people enjoy using?

Developers are tightly constrained

Developers like to develop, not choose options from drop down menus in a web based portal. With MDS, not only can Devs not make use of Visual Studio and a like but they are very tightly constrained by the business rules engine. At this point we should be able to make use of our preferred IDE so that we can benefit from source control, frameworks and customised business logic.

Not scalable according to modern expectations

Finally, MDS cannot scale to handle any kind of “big data”. It’s a bit of buzz word but as businesses collect more and more data, we need a data management option that can grow with that data. Due to the fact that MDS must be deployed from a server, there is no easy way to meet those big data requirements.

There are a few pieces to Matt’s solution, making for an interesting read.

Comments closed

Regular Expression Cheat Sheets

Mara Averick shows off a collection of regular expression guides:

There are helpful string-related R packages 📦, stringr (which is built on top of the more comprehensive stringi package) comes to mind. But, at some point in your computing life, you’re gonna need to get down with regular expressions.

And so, here’s a collection of some of the Regex-related links I’ve tweeted 🐦:

Click through for links to regular expression resources.

Comments closed

Keep That Data Raw

Archana Madhavan argues that you should retain your raw data:

When your pipeline already has to read every line of your data, it’s tempting to make it perform some fancy transformations. But you should steer clear of these add-ons so that you:

  • Avoid flawed calculations. If you have thousands of machines running your pipeline in real-time, sure, it’s easy to collect your data — but not so easy to tell if those machines are performing the right calculations.

  • Won’t limit yourself to the aggregates you decided on in the past. If you’re performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward — your old data is already set in stone.

  • Won’t break the pipeline. If you start doing fancy stuff on the pipeline, you’re eventually going to break it. So you may have a great idea for a new calculation, but if you implement it, you’re putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data.

The problem is that even if the cost of storage is much cheaper than before, there’s a fairly long tail before you get into potential revenue generation.  I like the idea, but selling it is hard when you generate a huge amount of data.

Comments closed

Building An API To Read An API

Jesse Seymour shows how to build a WebAPI project to retrieve JSON data from another API:

In this file, our goal is to create a class library that connects to an API, authenticates, retrieves JSON formatted data, and deserializes to output for use in a SSIS package.  In this particular solution, I created a separate DLL for the class library which will require me to register it in the global assembly cache on the ETL server.  If your environment doesn’t allow for this, you can still use some of the code snippets here to work with JSON data.

Our order of operations will be to do the following tasks:  Create a web request, attach authentication headers to it, retrieve the serialized JSON data, and deserialize it into an object.  I use model-view-controller (MVC) architecture to organize my code, minus the views because I am not presenting the data to a user interface.

Read on for a depiction and all of the project code.  Building a separate WebAPI project to retrieve this data is usually a good move, as you gain a lot of flexibility:  you can run it on cheaper hardware, schedule data refreshes, send the data out to different locations, and so on.

Comments closed

Inference Attacks

Phil Factor explains that your technique for pseudonymizing data doesn’t necessarily anonymize the data:

It is possible to mine data for hidden gems of information by looking at significant patterns of data. Unfortunately, this sometimes means that published datasets can reveal sensitive data when the publisher didn’t intend it, or even when they tried to prevent it by suppressing any part of the data that could enable individuals to be identified

Using creative querying, linking tables in ways that weren’t originally envisaged, as well as using well-known and documented analytical techniques, it’s often possible to infer the values of ‘suppressed’ data from the values provided in other, non-suppressed data. One man’s data mining is another man’s data inference attack.

Read the whole thing.  One big problem with trying to anonymize data is that you don’t know how much the attacker knows.  Especially with outliers or smaller samples, you might be able to glean interesting information with a series of queries.  Even if the application only returns aggregated results for some N, you can often put together a set of queries where you slice the population different ways until you get hidden details on individual.  Phil covers these types of inference attacks.

Comments closed