There are helpful string-related R packages 📦,
stringr(which is built on top of the more comprehensive
stringipackage) comes to mind. But, at some point in your computing life, you’re gonna need to get down with regular expressions.
And so, here’s a collection of some of the Regex-related links I’ve tweeted 🐦:
Click through for links to regular expression resources.
When your pipeline already has to read every line of your data, it’s tempting to make it perform some fancy transformations. But you should steer clear of these add-ons so that you:
Avoid flawed calculations. If you have thousands of machines running your pipeline in real-time, sure, it’s easy to collect your data — but not so easy to tell if those machines are performing the right calculations.
Won’t limit yourself to the aggregates you decided on in the past. If you’re performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward — your old data is already set in stone.
Won’t break the pipeline. If you start doing fancy stuff on the pipeline, you’re eventually going to break it. So you may have a great idea for a new calculation, but if you implement it, you’re putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data.
The problem is that even if the cost of storage is much cheaper than before, there’s a fairly long tail before you get into potential revenue generation. I like the idea, but selling it is hard when you generate a huge amount of data.
In this file, our goal is to create a class library that connects to an API, authenticates, retrieves JSON formatted data, and deserializes to output for use in a SSIS package. In this particular solution, I created a separate DLL for the class library which will require me to register it in the global assembly cache on the ETL server. If your environment doesn’t allow for this, you can still use some of the code snippets here to work with JSON data.
Our order of operations will be to do the following tasks: Create a web request, attach authentication headers to it, retrieve the serialized JSON data, and deserialize it into an object. I use model-view-controller (MVC) architecture to organize my code, minus the views because I am not presenting the data to a user interface.
Read on for a depiction and all of the project code. Building a separate WebAPI project to retrieve this data is usually a good move, as you gain a lot of flexibility: you can run it on cheaper hardware, schedule data refreshes, send the data out to different locations, and so on.
It is possible to mine data for hidden gems of information by looking at significant patterns of data. Unfortunately, this sometimes means that published datasets can reveal sensitive data when the publisher didn’t intend it, or even when they tried to prevent it by suppressing any part of the data that could enable individuals to be identified
Using creative querying, linking tables in ways that weren’t originally envisaged, as well as using well-known and documented analytical techniques, it’s often possible to infer the values of ‘suppressed’ data from the values provided in other, non-suppressed data. One man’s data mining is another man’s data inference attack.
Read the whole thing. One big problem with trying to anonymize data is that you don’t know how much the attacker knows. Especially with outliers or smaller samples, you might be able to glean interesting information with a series of queries. Even if the application only returns aggregated results for some N, you can often put together a set of queries where you slice the population different ways until you get hidden details on individual. Phil covers these types of inference attacks.
I have written articles before about how you can extract measures from a data model using DAX Studio and also using Power Pivot Utilities. These are both excellent tools in their own right and I encourage you to read up on those previous articles to learn more about these tools. Today however I am going to share another way you can extract a list of measures from an Excel Power Pivot Workbook without needing to install either of these 2 (excellent) software products. I often get called in to help people with their workbooks and sometimes they don’t have the right software installed for me to extract the list of measures (ie DAX Studio or PPU). This article explains how to extract the measures quickly without installing anything.
Matt uses a simple SQL statement to pull measure data into an Excel table, making it easy to retain the set of measures. There are some built-in documentation possibilities with this.
A data lake is a concept that opposes the idea of a data mart. Where a data mart is a silo with structured and cleansed data, a data lake is a huge data collection that is unstructured and raw. You could also say that a data mart is a bottle of clean water whereas the data lake is the lake with (not so clean) water. 🙂
Now why would you want a data lake? Imagine you are generating huge logfiles, for example in airplanes. Machines that track air pressure, temperature etc. If something goes wrong, you definitely want to be alerted. That is event-driven: “if A and B happen, alert pilot, or do C” and there are tools for dealing with that kind of streaming data. But what if the plane landed safely? What do you do with all that data? You do not need it anymore right?
Well, some people would say: “Wrong”. You might need that data later for reasons you do not know today. Google, Microsoft and Facebook are all hoarding data. Also data they are not sure they might need someday. This data could later prove to be valuable for AI, machine learning or for something else.
Read the whole thing. The data lake concept is powerful, but it requires at least as much data governance as prior models. Just because you can dump a bunch of files without thinking about it doesn’t mean you’ll get back something useful later.
The oil well drilling datasets contain raw information about wells and their formation details, drill types, and production dates. The Arkansas dataset has 6,040 records and the Oklahoma dataset has 2,559 records.
The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well and invalid well information with modified dates.
This raw data from the source is transformed to MS SQL for further filtering and normalization. To download raw data, look at the Reference section.
This is an example of applying several constraints and rules to a single data set. Each individual rule would probably be easier to do in T-SQL, but the whole bunch becomes easier to understand with a procedural language.
Seeing as the data had to be retrievable for any date, I could not simply delete the very old data. These tables also had constant inserts and updates into them, so making sure the tables remained available became important, i.e. needed to have acceptable time that the table was being locked, with time for waiting transactions to finish.
The solution I came up with does this with variable size batches. Now, with modern versions of SQL, there are other ways to do this, but the good thing about this method it works regardless of version of SQL, as well as edition. Azure SQL DB would need some modification to make it work to archive to a separate database.
Click through for the script.
Awesome stuff! We’ve got a database that was created in another container successfully attached into another one.
So at this point you may be wondering what the advantage is of doing this over mounting folders from the host? Well, to be honest, I really can’t see what the advantages are.
The volume is completely contained within the docker ecosystem so if anything happens to the docker install, we’ve lost the data. OK, OK, I know it’s in C:\ProgramData\docker\volumes\ on the host but still I’d prefer to have more control over its location.
It’s worth reading the whole thing, even though this isn’t the best way to keep data long-term. It’s important to know about this strategy even if only to keep it from accidentally ruining your day later.
What does your pipeline look like, and what steps are involved?
Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.
At first, I’d suggest just using delimited files, as it’s easiest that way. Once you have developed a bit of Hadoop maturity, then it makes sense to think about whether rowstore formats (like Parquet and Avro) or columnstore formats (like ORC) make sense for a particular data set.