It might sound a bit abrupt, but clean data is a myth. If your data is dirty, so is everyone else’s. Enterprises are more than dependent on data these days, and it is going to stay the same in coming years. They need to collect data in order to analyze it, which necessarily will not be 100% clean, pristine, or perfect in nature.
Nearly all companies face the challenge of dirty data in the form of a lot of duplicates, incorrect fields, and missing values. This happens due to omnichannel data influx, followed by hundreds, if not thousands, of employees wrestling and torturing that data to derive professional outcomes and insights. Don’t forget that even the best of the data has that tendency to decay in few weeks.
The saying goes that any analytics project is about 80% data cleansing and feature extraction. I’d say that number’s probably closer to 90-95%, and dirty data is a big part of that.
The General Data Protection Regulation (GDPR) will affect organisations in countries around the world, not just those in Europe. The GDPR regulates how personal data is stored, moved, handled, and destroyed. Not following the regulation will lead to dire consequences for your organisation. As a data professional or developer, you may have many questions and might be wondering how it will affect the way you will do your job. William Brewer answers common questions about the GDPR that you were too shy to ask.
Ever heard of the General Data Protection Regulation? If not, go and read the Wiki. I’ll wait.
I can already hear what you’re thinking. “Grant, this doesn’t apply to me because my company is in the <insert non-EU country here>.” How do I know you’re thinking that? Because every single person with whom I’ve brought this up has had the same response. You might want to go back and re-read it.
As a data professional, you’re going to want to know about this regulation.
Excel is easy to use, but not user friendly
Excel is on nearly every desktop in any Windows based organisation and with the Master Data Services Add-in, it puts the data well within the reach of the users. Whilst it is simple it is in no way user friendly when compared to other applications that your users may be using. Not to mention that for most this will be the only part of the solution they see! Wouldn’t it be great if there was a way to supply the same data but with an intuitive, mobile ready front end that people enjoy using?
Developers are tightly constrained
Developers like to develop, not choose options from drop down menus in a web based portal. With MDS, not only can Devs not make use of Visual Studio and a like but they are very tightly constrained by the business rules engine. At this point we should be able to make use of our preferred IDE so that we can benefit from source control, frameworks and customised business logic.
Not scalable according to modern expectations
Finally, MDS cannot scale to handle any kind of “big data”. It’s a bit of buzz word but as businesses collect more and more data, we need a data management option that can grow with that data. Due to the fact that MDS must be deployed from a server, there is no easy way to meet those big data requirements.
There are a few pieces to Matt’s solution, making for an interesting read.
There are helpful string-related R packages 📦,
stringr(which is built on top of the more comprehensive
stringipackage) comes to mind. But, at some point in your computing life, you’re gonna need to get down with regular expressions.
And so, here’s a collection of some of the Regex-related links I’ve tweeted 🐦:
Click through for links to regular expression resources.
When your pipeline already has to read every line of your data, it’s tempting to make it perform some fancy transformations. But you should steer clear of these add-ons so that you:
Avoid flawed calculations. If you have thousands of machines running your pipeline in real-time, sure, it’s easy to collect your data — but not so easy to tell if those machines are performing the right calculations.
Won’t limit yourself to the aggregates you decided on in the past. If you’re performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward — your old data is already set in stone.
Won’t break the pipeline. If you start doing fancy stuff on the pipeline, you’re eventually going to break it. So you may have a great idea for a new calculation, but if you implement it, you’re putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data.
The problem is that even if the cost of storage is much cheaper than before, there’s a fairly long tail before you get into potential revenue generation. I like the idea, but selling it is hard when you generate a huge amount of data.
In this file, our goal is to create a class library that connects to an API, authenticates, retrieves JSON formatted data, and deserializes to output for use in a SSIS package. In this particular solution, I created a separate DLL for the class library which will require me to register it in the global assembly cache on the ETL server. If your environment doesn’t allow for this, you can still use some of the code snippets here to work with JSON data.
Our order of operations will be to do the following tasks: Create a web request, attach authentication headers to it, retrieve the serialized JSON data, and deserialize it into an object. I use model-view-controller (MVC) architecture to organize my code, minus the views because I am not presenting the data to a user interface.
Read on for a depiction and all of the project code. Building a separate WebAPI project to retrieve this data is usually a good move, as you gain a lot of flexibility: you can run it on cheaper hardware, schedule data refreshes, send the data out to different locations, and so on.
It is possible to mine data for hidden gems of information by looking at significant patterns of data. Unfortunately, this sometimes means that published datasets can reveal sensitive data when the publisher didn’t intend it, or even when they tried to prevent it by suppressing any part of the data that could enable individuals to be identified
Using creative querying, linking tables in ways that weren’t originally envisaged, as well as using well-known and documented analytical techniques, it’s often possible to infer the values of ‘suppressed’ data from the values provided in other, non-suppressed data. One man’s data mining is another man’s data inference attack.
Read the whole thing. One big problem with trying to anonymize data is that you don’t know how much the attacker knows. Especially with outliers or smaller samples, you might be able to glean interesting information with a series of queries. Even if the application only returns aggregated results for some N, you can often put together a set of queries where you slice the population different ways until you get hidden details on individual. Phil covers these types of inference attacks.
I have written articles before about how you can extract measures from a data model using DAX Studio and also using Power Pivot Utilities. These are both excellent tools in their own right and I encourage you to read up on those previous articles to learn more about these tools. Today however I am going to share another way you can extract a list of measures from an Excel Power Pivot Workbook without needing to install either of these 2 (excellent) software products. I often get called in to help people with their workbooks and sometimes they don’t have the right software installed for me to extract the list of measures (ie DAX Studio or PPU). This article explains how to extract the measures quickly without installing anything.
Matt uses a simple SQL statement to pull measure data into an Excel table, making it easy to retain the set of measures. There are some built-in documentation possibilities with this.
A data lake is a concept that opposes the idea of a data mart. Where a data mart is a silo with structured and cleansed data, a data lake is a huge data collection that is unstructured and raw. You could also say that a data mart is a bottle of clean water whereas the data lake is the lake with (not so clean) water. 🙂
Now why would you want a data lake? Imagine you are generating huge logfiles, for example in airplanes. Machines that track air pressure, temperature etc. If something goes wrong, you definitely want to be alerted. That is event-driven: “if A and B happen, alert pilot, or do C” and there are tools for dealing with that kind of streaming data. But what if the plane landed safely? What do you do with all that data? You do not need it anymore right?
Well, some people would say: “Wrong”. You might need that data later for reasons you do not know today. Google, Microsoft and Facebook are all hoarding data. Also data they are not sure they might need someday. This data could later prove to be valuable for AI, machine learning or for something else.
Read the whole thing. The data lake concept is powerful, but it requires at least as much data governance as prior models. Just because you can dump a bunch of files without thinking about it doesn’t mean you’ll get back something useful later.
The oil well drilling datasets contain raw information about wells and their formation details, drill types, and production dates. The Arkansas dataset has 6,040 records and the Oklahoma dataset has 2,559 records.
The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well and invalid well information with modified dates.
This raw data from the source is transformed to MS SQL for further filtering and normalization. To download raw data, look at the Reference section.
This is an example of applying several constraints and rules to a single data set. Each individual rule would probably be easier to do in T-SQL, but the whole bunch becomes easier to understand with a procedural language.