Press "Enter" to skip to content

Category: ETL / ELT

The Continued Importance Of ETL

Andy Leonard explains that good old ETL remains vital to an organization:

A Problem

As Jen points out earlier in her Analytics Market Commoditization and Consolidation post (you should read it all – it’s awesome – like all of Jen’s posts!) many analytics solution providers share the “Same look, same marketing story, same saves time and allows users [to] avoid evil IT.”

I can hear some of you thinking, “Are you telling us analytics doesn’t work, Andy?” Goodness no. I’m telling you hype and sales strategy work in the analytics market as well as anywhere. When asked why a solution may not perform to expectations, the #1 response is “your data is not clean.”

Data engineering (think ETL specifically designed for analytics and “big data”) is the backbone behind data science.  To Andy’s point, the data engineer’s job is to get clean, context-heavy data in front of a data scientist, the same way a “classical” Business Intelligence specialist works with analysts.

Comments closed

Using Talend To Build Shape Files

Paul Hernandez has a demo where he uses Talend’s product to convert latitude and longitude pairs to a shape file:

Input data

Customers coordinates: a flat file containing x,y coordinates for every customer.

Municipalities in Austria: a shape file with multi-polygons defining the municipalities areas in Austria: source

Goal

The goal was to “look-up” the coordinates in the shape file in order to get the municipality code GKZ which in german stand for “Gemeindekennzahl”.

Check out the demo.

Comments closed

FlowFile Continuation In NiFi

Tim Spann describes one of the more powerful features of Apache NiFi:

Sometimes, you need to backup your current running flow, let that flow run at a later date, or make a backup of what is in-process now. You want this in a permanent storage and want to reconstitute it later like orange juice and add it back into the flow or restart it.

This could be due to failures, for integration testing, for testing new versions of components, as a checkpoint, or for many other purposes. You don’t always want to reprocess the original source or files (they may be gone).

Read on for an explanation of how FlowFile streams can do this.

Comments closed

Moving Files In Azure Data Factory

Meagan Longoria has a workaround for how you cannot move a file using Azure Data Factory:

But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).

Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.

Read on for the full details on this workaround.  Also, vote on this feedback item if you want the ability to move files instead of just copying them.

Comments closed

Shredding Excel With R

John MacKintosh shows how to use R for wrangling + ETL:

I had over 140 files to process. That’s not usually a big deal – I normally use SQL Server Integration Services to loop through network folders, connect to hundreds of spreadsheets and extract the source data.

But this relies on the data being in a tabular format (like a dataframe or database table).

A quick glance at the first few sheets confirmed I could not use this approach – the data was not in tabular format. Instead it was laid out in a format suited to viewing the data on screen – with the required data scattered in different ranges throughout each sheet ( over 100 rows and many columns). It wasn’t going to be feasible to point SSIS at different locations within each sheet. (It can be done, but it’s pretty complex and I didn’t have time to experiment).

The other challenge was that over time, changes to design meant that data moved location e.g. dates that were originally in cell C2 moved to D7, then moved again as requirements evolved. There were 14 different templates in all, each with subtle changes. Each template was going to need a custom solution to extract the data.

This is a good look at how R can be about more than “just” statistical analysis.

Comments closed

Importing CSV Files In Power BI

Gil Raviv explains the new “combine binaries” feature of Power BI Desktop:

The Power BI team has recently released an enhanced “combine binaries” experience as part of November 2016 update to Power BI Desktop. (Jargon Alert:  “Combine Binaries” is a scary term.  Instead it should be named “Magically combine multiple files together into one table and make me SUPER happy.”)  The improved experience can drastically help you to import multiple Excel or other files from a folder and avoid writing advanced query functions. But today we will focus on a specific scenario, which is so common that it deserves this special post – Handling CSV files.

In fact, today’s blog post is actually the first post in “The CSV Series”. I hope you will enjoy it. To celebrate the November update of Power BI Desktop, we will review the improved experience, and will walk you through one of the most common scenarios that is now so easy to implement – Importing multiple CSV files from a folder, including parts of their filenames.

This looks very useful.

Comments closed

Quickly Reloading Tables

Kenneth Fisher uses table partitioning to perform fast loads of data:

Now if this table is paritioned you’d use SWITCH and bring in a new partition.

For those that don’t know, when a table is partitioned, you can create a new empty partition, and a new empty table, load the table, make the table exactly match the partition (structure, check constraints, & indexes for example) and you can SWITCH it in. The SWITCH part is a metadata operation and is fast!

But what do you do if the table isn’t partitioned? Well, I was having a conversation with Andy Mallon (b/t) and he reminded me of something.

Read on for the details.  The upshot is that you can take your time loading the second table and once you’re ready to swap out, it’s a quick metadata change.  That’s really useful for ETL scenarios.

Comments closed

Exporting To Flat Files

Ben Weissman shows how to dump tables to flat files:

In our next step, we loop through all tables in that database (feel free to limit the results by playing with GetDatabaseSchema) and create a FlatFileFormat for each of them. We will include all columns except those with datatype Binary or Object. As flatfiles don’t really care about actual data formats, we will just define every column as a string with maximum length. We will also add an annotation with the table’s original name, the list of columns as well as a list of primary keys (we’ll need the latter for a later step :)):

Like most Biml-related things, it’s not that many lines of code, so check it out.

Comments closed

Bulk Load Tools

Erland Sommarskog has a brand new essay:

The bulk-load tools have been in the product for a long time and they are showing their age. When they work for you, they are powerful. But you need to understand that these tools are binary to their heart, and they have no built-in rule that says that each line a file is a record – they don’t even think in lines. You also need to understand that there are file formats they are not able to handle.

I have tried to arrange the material in this article so that if you have a simple problem, you only need to read the first two chapters after the introduction. I first introduce you them to their mindset, which is likely to be different from yours. Next I cover the basic options to use for every-day work. If you have a more complex file, you will need to use a format file and the next three chapters are for you. I first describe how format files work as such, and the next two chapters show how to use format files for common cases for import and export respectively. This is followed by a chapter about Unicode files, including files encoded in UTF‑8. Then comes a chapter about “advanced” options, including how to load explicit values into an IDENTITY column. A short chapter covers permissions. The last chapter discusses XML format files, and I am not sorry at all if you give this chapter a blind eye – I find XML format files to be of dubious value.

I haven’t had a chance to read this yet, but because I have never had good luck with bcp and BULK INSERT, it’s on my to-read list.

Comments closed