The greatest stumbling block our respondents identified as hindering their attempts at better utilizing data is one that has existed for some time but seems to have worsened as data volumes have grown – data silos. Only 2 percent of our respondents considered their business to be completely effective at data sharing – for the rest, data silos are a real problem.
The causes for this are numerous, and span inconsistency of systems being used (42 percent), different data formats (38 percent), and a lack of coordinated data strategy (37 percent). On top of this, over a third highlight a lack of technology integration (36 percent) and/or legacy technology barriers (36 percent) as blocking attempts to effectively share data.
My first response is to say that this is in part due to the growth of microservices architecture, which seems to push data siloing. But at the same time, this has been the case for a long time, so I don’t think it’s either a necessary or a sufficient explanation.
Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was “easy,” which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.
One definition of “big data” is “Data that is too big to fit on one machine.” By that definition what is “big data” for one language is plain-old “data” for another. Java, with it’s efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It’s a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.
But you don’t have to take my word for that.
There are some interesting thoughts in this post, but there are limits to what a single machine can do.
The other day, I had a problem with some data that I never dreamed I would ever see. In a case insensitive database, in a table’s column that was case insensitive, the customer was using the data as case sensitive. Firstly, let’s just go ahead and say it. “This was a sucky implementation.” But as is common, in my typical role as a data architect in the data warehousing team, I get to learn all sorts of interesting techniques for finding and dealing with “data” that has been used in “interesting” ways.
What is kind of interesting is actually figuring out what that duplicated data was. The case that I was dealing with wasn’t a kind of useful packed surrogate value, where you may use a base 62 number, with a-z, A-Z and 0-9 as characters. So 1, 2, … , 9, 0, a, b, c, … x, y, z, A, B.. etc. 1A1 is a different value in that sequence than 1a1, and is greater . Neat technique, and one that I have been threatening to develop using a SEQUENCE object, where you can pack in a lot of sequential data in a small number of bytes. No, this wasn’t a useful case such as this, in this case, one value was lower case, another had leading capitals. So perhaps “active customer” and “Active Customer”. Yeah, seriously, they meant different things.
Louis shows some of the nuance required in making this work.
I’ve written extensively about the tremendous potential for big data in healthcare to drive enormous changes in how we keep people healthy for longer. It goes without saying however that all data is not created equal, and just having a large sample is not always sufficient to get the best insights.
If we needed reminding, a reminder comes via a recent study from the University of California, Berkeley. It suggests that things like emotion, behavior, and physiology vary hugely between individuals, therefore having an average over a large dataset can still produce a ‘norm’ that is wide of the mark for individuals.
“If you want to know what individuals feel or how they become sick, you have to conduct research on individuals, not on groups,” the researchers say. “Diseases, mental disorders, emotions, and behaviors are expressed within individual people, over time. A snapshot of many people at one moment in time can’t capture these phenomena.”
Variance is important.
The question that I’m trying to answer is: what are the valid “letters” and “decimal numbers” from other national scripts?
I tried using the online research tool “UnicodeSet”, but that gave slightly different results compared (using the “alphabetic” and “numeric_type = decimal” properties) to what I discovered SQL Server actually accepts.
I then loaded the actual Unicode 3.2 data files only to find that the number of characters having either the “alphabetic” or “numeric_type = decimal” properties was different than both the online search and what SQL Server actually accepts.
Click through to find the real Unicode killer.
Now, when you save an execution plan out to a file, you’re potentially transmitting PI data. It goes further. When you hard code values, PI is not just in the query. Those PI values can also be stored throughout the plan in various properties.
So now you see what I mean when I say that the GDPR affects how we deal with execution plans. I’m not done yet.
Unfortunately, questions like the one Grant raises here won’t be answered until we see a few test cases in the European courts.
It might sound a bit abrupt, but clean data is a myth. If your data is dirty, so is everyone else’s. Enterprises are more than dependent on data these days, and it is going to stay the same in coming years. They need to collect data in order to analyze it, which necessarily will not be 100% clean, pristine, or perfect in nature.
Nearly all companies face the challenge of dirty data in the form of a lot of duplicates, incorrect fields, and missing values. This happens due to omnichannel data influx, followed by hundreds, if not thousands, of employees wrestling and torturing that data to derive professional outcomes and insights. Don’t forget that even the best of the data has that tendency to decay in few weeks.
The saying goes that any analytics project is about 80% data cleansing and feature extraction. I’d say that number’s probably closer to 90-95%, and dirty data is a big part of that.
The General Data Protection Regulation (GDPR) will affect organisations in countries around the world, not just those in Europe. The GDPR regulates how personal data is stored, moved, handled, and destroyed. Not following the regulation will lead to dire consequences for your organisation. As a data professional or developer, you may have many questions and might be wondering how it will affect the way you will do your job. William Brewer answers common questions about the GDPR that you were too shy to ask.
Ever heard of the General Data Protection Regulation? If not, go and read the Wiki. I’ll wait.
I can already hear what you’re thinking. “Grant, this doesn’t apply to me because my company is in the <insert non-EU country here>.” How do I know you’re thinking that? Because every single person with whom I’ve brought this up has had the same response. You might want to go back and re-read it.
As a data professional, you’re going to want to know about this regulation.
Excel is easy to use, but not user friendly
Excel is on nearly every desktop in any Windows based organisation and with the Master Data Services Add-in, it puts the data well within the reach of the users. Whilst it is simple it is in no way user friendly when compared to other applications that your users may be using. Not to mention that for most this will be the only part of the solution they see! Wouldn’t it be great if there was a way to supply the same data but with an intuitive, mobile ready front end that people enjoy using?
Developers are tightly constrained
Developers like to develop, not choose options from drop down menus in a web based portal. With MDS, not only can Devs not make use of Visual Studio and a like but they are very tightly constrained by the business rules engine. At this point we should be able to make use of our preferred IDE so that we can benefit from source control, frameworks and customised business logic.
Not scalable according to modern expectations
Finally, MDS cannot scale to handle any kind of “big data”. It’s a bit of buzz word but as businesses collect more and more data, we need a data management option that can grow with that data. Due to the fact that MDS must be deployed from a server, there is no easy way to meet those big data requirements.
There are a few pieces to Matt’s solution, making for an interesting read.
There are helpful string-related R packages 📦,
stringr(which is built on top of the more comprehensive
stringipackage) comes to mind. But, at some point in your computing life, you’re gonna need to get down with regular expressions.
And so, here’s a collection of some of the Regex-related links I’ve tweeted 🐦:
Click through for links to regular expression resources.