Press "Enter" to skip to content

Category: Data

Keep That Data Raw

Archana Madhavan argues that you should retain your raw data:

When your pipeline already has to read every line of your data, it’s tempting to make it perform some fancy transformations. But you should steer clear of these add-ons so that you:

  • Avoid flawed calculations. If you have thousands of machines running your pipeline in real-time, sure, it’s easy to collect your data — but not so easy to tell if those machines are performing the right calculations.

  • Won’t limit yourself to the aggregates you decided on in the past. If you’re performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward — your old data is already set in stone.

  • Won’t break the pipeline. If you start doing fancy stuff on the pipeline, you’re eventually going to break it. So you may have a great idea for a new calculation, but if you implement it, you’re putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data.

The problem is that even if the cost of storage is much cheaper than before, there’s a fairly long tail before you get into potential revenue generation.  I like the idea, but selling it is hard when you generate a huge amount of data.

Comments closed

Building An API To Read An API

Jesse Seymour shows how to build a WebAPI project to retrieve JSON data from another API:

In this file, our goal is to create a class library that connects to an API, authenticates, retrieves JSON formatted data, and deserializes to output for use in a SSIS package.  In this particular solution, I created a separate DLL for the class library which will require me to register it in the global assembly cache on the ETL server.  If your environment doesn’t allow for this, you can still use some of the code snippets here to work with JSON data.

Our order of operations will be to do the following tasks:  Create a web request, attach authentication headers to it, retrieve the serialized JSON data, and deserialize it into an object.  I use model-view-controller (MVC) architecture to organize my code, minus the views because I am not presenting the data to a user interface.

Read on for a depiction and all of the project code.  Building a separate WebAPI project to retrieve this data is usually a good move, as you gain a lot of flexibility:  you can run it on cheaper hardware, schedule data refreshes, send the data out to different locations, and so on.

Comments closed

Inference Attacks

Phil Factor explains that your technique for pseudonymizing data doesn’t necessarily anonymize the data:

It is possible to mine data for hidden gems of information by looking at significant patterns of data. Unfortunately, this sometimes means that published datasets can reveal sensitive data when the publisher didn’t intend it, or even when they tried to prevent it by suppressing any part of the data that could enable individuals to be identified

Using creative querying, linking tables in ways that weren’t originally envisaged, as well as using well-known and documented analytical techniques, it’s often possible to infer the values of ‘suppressed’ data from the values provided in other, non-suppressed data. One man’s data mining is another man’s data inference attack.

Read the whole thing.  One big problem with trying to anonymize data is that you don’t know how much the attacker knows.  Especially with outliers or smaller samples, you might be able to glean interesting information with a series of queries.  Even if the application only returns aggregated results for some N, you can often put together a set of queries where you slice the population different ways until you get hidden details on individual.  Phil covers these types of inference attacks.

Comments closed

Finding DAX Measures In Use

Matt Allington shows an easy way to enumerate the DAX measures in a Power Pivot workbook in Excel:

I have written articles before about how you can extract measures from a data model using DAX Studio and also using Power Pivot Utilities.  These are both excellent tools in their own right and I encourage you to read up on those previous articles to learn more about these tools. Today however I am going to share another way you can extract a list of measures from an Excel Power Pivot Workbook without needing to install either of these 2 (excellent) software products.  I often get called in to help people with their workbooks and sometimes they don’t have the right software installed for me to extract the list of measures (ie DAX Studio or PPU).  This article explains how to extract the measures quickly without installing anything.

Matt uses a simple SQL statement to pull measure data into an Excel table, making it easy to retain the set of measures.  There are some built-in documentation possibilities with this.

Comments closed

Diving Into The Data Lake

Jesse Gorter explains the data lake metaphor:

A data lake is a concept that opposes the idea of a data mart. Where a data mart is a silo with structured and cleansed data, a data lake is a huge data collection that is unstructured and raw. You could also say that a data mart is a bottle of clean water whereas the data lake is the lake with (not so clean) water. 🙂

Now why would you want a data lake? Imagine you are generating huge logfiles, for example in airplanes. Machines that track air pressure, temperature etc. If something goes wrong, you definitely want to be alerted. That is event-driven: “if A and B happen, alert pilot, or do C” and there are tools for dealing with that kind of streaming data. But what if the plane landed safely? What do you do with all that data? You do not need it anymore right?

Well, some people would say: “Wrong”. You might need that data later for reasons you do not know today. Google, Microsoft and Facebook are all hoarding data. Also data they are not sure they might need someday. This data could later prove to be valuable for AI, machine learning or for something else.

Read the whole thing.  The data lake concept is powerful, but it requires at least as much data governance as prior models.  Just because you can dump a bunch of files without thinking about it doesn’t mean you’ll get back something useful later.

Comments closed

Data Cleanup Using Drools

Rathnadevi Manivannan gives an example of using Drools to create rule-based data cleansing processes:

The oil well drilling datasets contain raw information about wells and their formation details, drill types, and production dates. The Arkansas dataset has 6,040 records and the Oklahoma dataset has 2,559 records.

The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well and invalid well information with modified dates.

This raw data from the source is transformed to MS SQL for further filtering and normalization. To download raw data, look at the Reference section.

This is an example of applying several constraints and rules to a single data set.  Each individual rule would probably be easier to do in T-SQL, but the whole bunch becomes easier to understand with a procedural language.

Comments closed

Batched Data Archival

Anders Pedersen has a script to archive data in smaller batches:

Seeing as the data had to be retrievable for any date, I could not simply delete the very old data.  These tables also had constant inserts and updates into them, so making sure the tables remained available became important, i.e. needed to have acceptable time that the table was being locked, with time for waiting transactions to finish.

The solution I came up with does this with variable size batches.  Now, with modern versions of SQL, there are other ways to do this, but the good thing about this method it works regardless of version of SQL, as well as edition.  Azure SQL DB would need some modification to make it work to archive to a separate database.

Click through for the script.

Comments closed

Creating Docker Volumes

Andrew Pruski continues his series on long-term data storage and Docker:

Awesome stuff! We’ve got a database that was created in another container successfully attached into another one.

So at this point you may be wondering what the advantage is of doing this over mounting folders from the host? Well, to be honest, I really can’t see what the advantages are.

The volume is completely contained within the docker ecosystem so if anything happens to the docker install, we’ve lost the data. OK, OK, I know it’s in C:\ProgramData\docker\volumes\ on the host but still I’d prefer to have more control over its location.

It’s worth reading the whole thing, even though this isn’t the best way to keep data long-term.  It’s important to know about this strategy even if only to keep it from accidentally ruining your day later.

Comments closed

Choosing A Hadoop Data Format

Silvia Oliveros has a set of considerations to help you choose a file format for your data in Hadoop:

What does your pipeline look like, and what steps are involved?

Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.

At first, I’d suggest just using delimited files, as it’s easiest that way.  Once you have developed a bit of Hadoop maturity, then it makes sense to think about whether rowstore formats (like Parquet and Avro) or columnstore formats (like ORC) make sense for a particular data set.

Comments closed

Updating Large Tables In SQL Server And Oracle

Jana Sattainathan has a post on how he was able to move and update billions of rows, using both Oracle and SQL Server as examples:

The key thing to remember with SQL Server is to convert to a non-integer value by using a “decimal” as shown in the above example with “10.”. This is the same as saying “10.0”. Without the “.”, it will result in uneven splits from rounding errors of integers. It is not the result that you intend to have it you want accurate results.

To show you the difference, I have included the SQL and results of a query that uses “.” and the other that does not, with “.” being the only difference:

It’s a good article, and definitely an important thing to think about when you have large tables.

Comments closed