Top-Down ETL With Powershell

Max Trinidad continues his series on top-down ETL using Powershell:

After all the previous functions has been loaded, just type the following one-liner:

Process-PSObjectToSQL -SQLDataObj $SQLData;

This sample script code can serve as a startup Template to load data into SQL Server.

This sample SQL data load will fail. Here’s when the Try/Catch/Finally will work for you in trapping what went wrong. Adding the necessary code to provide that additional information to troubleshoot and fix the problem.

Parts one and two available as well.

Incorporating NiFi Into Brownfield Code

Kevin Feasel

2016-06-07

ETL, Hadoop

Paul Boal discusses how he incorporated Apache NiFi in an existing process:

Typically, data warehousing and ETL tool vendors recommended that we write your own custom components. After all, the target market for ETL tools is a space where the tools are specifically marketed as reducing the need for “error prone and time consuming” manual coding. When I ran across this tutorial on writing your own NiFi processor it occurred to me that NiFi is the exact opposite. It’s both Open Source and designed for extensibility from the ground up. I found it quite reasonable to write a custom NiFi processor that leverages our existing code base.

The existing code is a Java program with separate classes for each device vendor, all with the same interface to abstract the nuances of each vendor from the main data export program. This interface follows a traditional paradigm: login, query, query, query, logout. Given that my input to NiFi above takes in simple username, password, and query criteria arguments, it seems trivial to create a NiFi processor class that adapts the existing code into the NiFi API. Here’s a slightly abbreviated version of the actual code. (In reality, it’s all of 70 lines of code.)

In almost any realistic scenario, you’re not going to have the opportunity to start from scratch.  You will always have legacy components, external dependencies, and existing user bases to satisfy.  I like this article because it moves forward from that starting point.

Powershell ETL, Part 2

Max Trinidad has part 2 of his Powershell ETL series:

If you notice, in the above cmdlet the where-clause I’m selecting to use the Column1 property instead of a reasonable label. In my scenario the data in the CSV file contain variable columns fopr its different data types such as: Info, Error, and System. So, it was easy to identify the total number of columns to be 15 columns.

Now, using the cmdlet “Import-Csv” using the parameter “-Header”, you can define a list columns when you build the $Logdata object. We create the $header variable with the column-names separated by comma.

Keep an eye out for part 3.  In the meantime, check out part 1 if you haven’t already.

Creating An ETL Process In Powershell

Max Trinidad is building a Powershell-based solution for ETL from scratch:

So after the drive gets mapped to T: drive, we need to look and collect the type of logs we want to pull. In my scenario, I’m looking for all log labeled “*.Events.*.log.*”. One caveat discovered previously, these text logs file doesn’t contains servername information. But, No Problem! This is another opportunity to be creative with PowerShell.

Here we use the Cmdlet “Get-ChildItem” with the “Sort-Object” to sort the results by its object property “LastWriteTime“. You will file this property very useful later as you progress in our data collection process. This results set wil need to be stored in PowerShell Object

I’m interested in seeing where this goes, especially because my first choice for ETL would be SSIS with Biml.

Clustered Columnstore Index Load With SSIS

Koen Verbeeck looks at loading a clustered columnstore index using SSIS:

I stumbled upon this MSDN blog post: SQL Server 2016 SSIS Data Flow Buffer Auto Sizing capability benefits data loading on Clustered Columnstore tables (catchy title). It explains how you can set the buffer properties of the data flow to try to insert data directly into compressed row groups instead of in the delta store. They fail to achieve this using SSIS 2014 and then they explain how using the new AutoAdjustBufferSize property of SSIS 2016 works miracles and everything is loaded directly into compressed row groups. Hint: you want to avoid loading data into the delta store, as it is row storage and you need to wait for the  tuple mover to load the data to the CCI in the background.

However, it’s still possible to achieve the same using SSIS 2014 (or earlier). Niko Neugebauer (blog |twitter) shows this in his post Clustered Columnstore Indexes – part 51 (“SSIS, DataFlow & Max Buffer Memory”). It still depends on the estimated row size, but using these settings you should get better results:

This advice is a bit different from loading standard rowstore-based tables, but serves to pack as many rows into each columnstore row group as possible.

File Processing Pattern

Bill Fellows describes a pattern for processing files in an ETL scenario:

All ETL processing will use a common root node/directory. I call it SSISData to make it fairly obvious what is in there but call it as you will. On my dev machine, this is usually sitting right off C:. On servers though, oh hell no! The C drive is reserved for the OS. Instead, I work with the DBAs to determine where this will actually be located. Heck, they could make it a UNC path and my processing won’t care because I ensure that root location is an externally configurable “thing.” Whatever you’re using for ETL, I’m certain it will support the concept of configuration. Make sure the base node is configurable.

A nice thing about anchoring all your file processing to a head location is that if you are at an organization that is judicious with handing out file permissions, you don’t have to create a permission request for each source of data. Get your security team to sign off on the ETL process having full control to a directory structure starting here. The number of SSIS related permissions issues I answer on StackOverflow is silly.

It comes down to consistency and cleanliness.  Plan ahead and you’ll have a much nicer go of things.  Bill also provides a Biml POC, so check that out as well.

Ragged Right Files

Sifiso Ndlovu walks us through ragged right formatted files in Integration Services:

The configuration of columns is perhaps a critical part of the entire ETL process as it helps us build mapping metadata for your ETL. In fact, regardless of where or not SSIS/SSMS can detect delimiters, if you skip Column Mapping section – your ETL will fail validation. In order to clarify how Ragged right formatted files work, I have gone a step back and used Figure 4 to actually displayed a preview of our fictitious Fruits transaction dataset from Notepad++. It can already be seen from Notepad++ that the file only has row delimiter in a form of CRLF.

Read the whole thing.

Using The Import-Export Wizard

Kevin Feasel

2016-02-17

ETL

James Anderson uses the Import-Export wizard to import data:

Whilst working on an upcoming post I realised that I needed a decent data set on which to test some of the new visualisations in SSRS 2016. I remembered this post on open data sets by Kendra Little that mentioned Data.gov. There are a lot of different data sets here but I settled on a set describing social media usage by organisations in New York.

The Import-Export wizard is nice for these types of one-off data loads—it’s the gateway into SSIS.

Options To Capture Changed Data

Koen Verbeeck looks at various ways of capturing changed data:

  • In some very rare cases, you can actually use change data capture or change tracking on the source system. If you get one of those features implemented, you’re golden. But most of the time you’re not, as a lot of administrators don’t like them because of potential performance impact.

Koen lists several options.  One additional option is to use triggers to capture changes in a queue table.  If you are dealing with SCD-1 changes (in which you do not need a full reckoning of history) or periodic SCD-2 (in which you keep history but are okay with smashing some changes together if they’re within a time period between ETL loads), loading IDs of changed records into a queue table is reasonably efficient and gets around trying to make sure everybody updates the modified date.  It has its own drawbacks, though, starting with it using triggers…

Incremental Loading Using Datetime Columns

Reza Rad shows a pattern for implementing incremental loads using modified dates:

The idea behind this method is to store the latest ETL run time in a config or log table, and then in the next ETL run just load records from the source table that have modified (with their modified date greater than or equal to) after the latest ETL run datetime. This will create the change set for the data table. The change set might contains inserted, updated, or deleted records. to identify which change happened on the record you need to compare the change set with existing records and separate inserted, updated, and deleted records. This change set with the action on each record can be inserted into staging tables, and then be used to apply on the fact table based on appropriate action.

In my experience, the hardest part about this is making sure people update ModifiedTime when they update rows in the table.

Categories

January 2019
MTWTFSS
« Dec  
 123456
78910111213
14151617181920
21222324252627
28293031