It is possible to mine data for hidden gems of information by looking at significant patterns of data. Unfortunately, this sometimes means that published datasets can reveal sensitive data when the publisher didn’t intend it, or even when they tried to prevent it by suppressing any part of the data that could enable individuals to be identified
Using creative querying, linking tables in ways that weren’t originally envisaged, as well as using well-known and documented analytical techniques, it’s often possible to infer the values of ‘suppressed’ data from the values provided in other, non-suppressed data. One man’s data mining is another man’s data inference attack.
Read the whole thing. One big problem with trying to anonymize data is that you don’t know how much the attacker knows. Especially with outliers or smaller samples, you might be able to glean interesting information with a series of queries. Even if the application only returns aggregated results for some N, you can often put together a set of queries where you slice the population different ways until you get hidden details on individual. Phil covers these types of inference attacks.
I have written articles before about how you can extract measures from a data model using DAX Studio and also using Power Pivot Utilities. These are both excellent tools in their own right and I encourage you to read up on those previous articles to learn more about these tools. Today however I am going to share another way you can extract a list of measures from an Excel Power Pivot Workbook without needing to install either of these 2 (excellent) software products. I often get called in to help people with their workbooks and sometimes they don’t have the right software installed for me to extract the list of measures (ie DAX Studio or PPU). This article explains how to extract the measures quickly without installing anything.
Matt uses a simple SQL statement to pull measure data into an Excel table, making it easy to retain the set of measures. There are some built-in documentation possibilities with this.
A data lake is a concept that opposes the idea of a data mart. Where a data mart is a silo with structured and cleansed data, a data lake is a huge data collection that is unstructured and raw. You could also say that a data mart is a bottle of clean water whereas the data lake is the lake with (not so clean) water. 🙂
Now why would you want a data lake? Imagine you are generating huge logfiles, for example in airplanes. Machines that track air pressure, temperature etc. If something goes wrong, you definitely want to be alerted. That is event-driven: “if A and B happen, alert pilot, or do C” and there are tools for dealing with that kind of streaming data. But what if the plane landed safely? What do you do with all that data? You do not need it anymore right?
Well, some people would say: “Wrong”. You might need that data later for reasons you do not know today. Google, Microsoft and Facebook are all hoarding data. Also data they are not sure they might need someday. This data could later prove to be valuable for AI, machine learning or for something else.
Read the whole thing. The data lake concept is powerful, but it requires at least as much data governance as prior models. Just because you can dump a bunch of files without thinking about it doesn’t mean you’ll get back something useful later.
The oil well drilling datasets contain raw information about wells and their formation details, drill types, and production dates. The Arkansas dataset has 6,040 records and the Oklahoma dataset has 2,559 records.
The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well and invalid well information with modified dates.
This raw data from the source is transformed to MS SQL for further filtering and normalization. To download raw data, look at the Reference section.
This is an example of applying several constraints and rules to a single data set. Each individual rule would probably be easier to do in T-SQL, but the whole bunch becomes easier to understand with a procedural language.
Seeing as the data had to be retrievable for any date, I could not simply delete the very old data. These tables also had constant inserts and updates into them, so making sure the tables remained available became important, i.e. needed to have acceptable time that the table was being locked, with time for waiting transactions to finish.
The solution I came up with does this with variable size batches. Now, with modern versions of SQL, there are other ways to do this, but the good thing about this method it works regardless of version of SQL, as well as edition. Azure SQL DB would need some modification to make it work to archive to a separate database.
Click through for the script.
Awesome stuff! We’ve got a database that was created in another container successfully attached into another one.
So at this point you may be wondering what the advantage is of doing this over mounting folders from the host? Well, to be honest, I really can’t see what the advantages are.
The volume is completely contained within the docker ecosystem so if anything happens to the docker install, we’ve lost the data. OK, OK, I know it’s in C:\ProgramData\docker\volumes\ on the host but still I’d prefer to have more control over its location.
It’s worth reading the whole thing, even though this isn’t the best way to keep data long-term. It’s important to know about this strategy even if only to keep it from accidentally ruining your day later.
What does your pipeline look like, and what steps are involved?
Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.
At first, I’d suggest just using delimited files, as it’s easiest that way. Once you have developed a bit of Hadoop maturity, then it makes sense to think about whether rowstore formats (like Parquet and Avro) or columnstore formats (like ORC) make sense for a particular data set.
The key thing to remember with SQL Server is to convert to a non-integer value by using a “decimal” as shown in the above example with “10.”. This is the same as saying “10.0”. Without the “.”, it will result in uneven splits from rounding errors of integers. It is not the result that you intend to have it you want accurate results.
To show you the difference, I have included the SQL and results of a query that uses “.” and the other that does not, with “.” being the only difference:
It’s a good article, and definitely an important thing to think about when you have large tables.
The Problem I am Trying To Solve
“The fooled-by-data effect is accelerating. There is a nasty phenomenon called ‘Big Data’ in which researchers have brought cherry-picking to an industrial level. Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.” – Nassim Nicholas Taleb, Antifragile, p. 416
According to Taleb, there’s a bias for error embedded in big data; more is not better, it’s worse. I’ve experienced this with business intelligence solutions and spoken about data quality in data warehouse solutions, saying:
“The ratio of good:bad data in a useless / inaccurate data warehouse is surprisingly high; almost always north of 95% and often higher than 99%.”
Taleb states more data includes a disproportionate amount of bad data, and that bigger data results in more spurious correlations. In other words, more is not better – it’s worse.
It’s an idea worth grappling with. The other side of the argument is that for some problems, you won’t know what you need until you need it.
There is no value in data.
If you’re still here, then I am assuming you either a) believe I have a valid point, or b) just want to see how crazy I am for opening my new data blog with a post spouting the lack of value in data. We’ll see which option is right by the end of the post because right now, I am not so sure which one is right and which one is wrong. After all, if there is no value in data, why should companies hire data professionals and give them a pay check?
My long-form response is too long for this format, so the short response is that data requires context. I agree that action is important, but the purpose of a data visualization professional is to provide information with the relevant context to assist decision-making. It’s not that there’s no value in data or that action is everything; it’s a multi-faceted process, and the specific relevant data will depend upon the industry. In professional sports, front offices certainly use accurate(-ish) metrics which show the worst performing players on the team because sports leagues are zero-sum games. Finding out Fred in Accounts Receivable spends the most time at the coffeemaker each day (17 minutes instead of 12 minutes!) matters a lot less, so unless you’re doing a Taylor-style factory study—and if you are, I’ll have other words with you that also aren’t apropos here—it doesn’t rate high enough in the relative priority list.