Press "Enter" to skip to content

Category: ETL / ELT

Ragged Right Files

Sifiso Ndlovu walks us through ragged right formatted files in Integration Services:

The configuration of columns is perhaps a critical part of the entire ETL process as it helps us build mapping metadata for your ETL. In fact, regardless of where or not SSIS/SSMS can detect delimiters, if you skip Column Mapping section – your ETL will fail validation. In order to clarify how Ragged right formatted files work, I have gone a step back and used Figure 4 to actually displayed a preview of our fictitious Fruits transaction dataset from Notepad++. It can already be seen from Notepad++ that the file only has row delimiter in a form of CRLF.

Read the whole thing.

Comments closed

Using The Import-Export Wizard

James Anderson uses the Import-Export wizard to import data:

Whilst working on an upcoming post I realised that I needed a decent data set on which to test some of the new visualisations in SSRS 2016. I remembered this post on open data sets by Kendra Little that mentioned Data.gov. There are a lot of different data sets here but I settled on a set describing social media usage by organisations in New York.

The Import-Export wizard is nice for these types of one-off data loads—it’s the gateway into SSIS.

Comments closed

Options To Capture Changed Data

Koen Verbeeck looks at various ways of capturing changed data:

  • In some very rare cases, you can actually use change data capture or change tracking on the source system. If you get one of those features implemented, you’re golden. But most of the time you’re not, as a lot of administrators don’t like them because of potential performance impact.

Koen lists several options.  One additional option is to use triggers to capture changes in a queue table.  If you are dealing with SCD-1 changes (in which you do not need a full reckoning of history) or periodic SCD-2 (in which you keep history but are okay with smashing some changes together if they’re within a time period between ETL loads), loading IDs of changed records into a queue table is reasonably efficient and gets around trying to make sure everybody updates the modified date.  It has its own drawbacks, though, starting with it using triggers…

Comments closed

Incremental Loading Using Datetime Columns

Reza Rad shows a pattern for implementing incremental loads using modified dates:

The idea behind this method is to store the latest ETL run time in a config or log table, and then in the next ETL run just load records from the source table that have modified (with their modified date greater than or equal to) after the latest ETL run datetime. This will create the change set for the data table. The change set might contains inserted, updated, or deleted records. to identify which change happened on the record you need to compare the change set with existing records and separate inserted, updated, and deleted records. This change set with the action on each record can be inserted into staging tables, and then be used to apply on the fact table based on appropriate action.

In my experience, the hardest part about this is making sure people update ModifiedTime when they update rows in the table.

Comments closed

Migrating Always Encrypted Data

Nitish Upreti shows us how to migrate data encrypted using Always Encrypted:

Our customers use the SQL platform to store volumes of high-valued data assets for their organization. With Always Encrypted, we want to deliver additional security while ensuring complete integrity of stored user data. To achieve this, in a regular workflow SQL Server/Azure SQL Database disallows an application to insert data directly into an encrypted column. If the application has not enabled Always Encrypted in the connection string, an insert statement targeting an encrypted column will fail. To insert a value into a column protected with Always Encrypted, the application must connect to the database with Always Encrypted enabled in the connection string and pass a plaintext value of the datatype configured for the target column. Subsequently, the SQL client driver encrypts the value and sends the ciphertext to the database. This ensures plaintext data is encrypted and stored appropriately.

Read the whole thing.

Comments closed

Populating Lookup Query Using A Variable

Meagan Longoria shows us how to use a variable to populate a lookup query in SSIS:

I already had my data flow populated with the lookup for MSA. I set it to full cache and entered a query in the connection to initially populate the fields that would be returned (simply my lookup query without the where clause).

Next, I opened the data flow properties, located Expressions and clicked on the ellipses to open the Property Expression  Editor. I found the SQLCommand property for my MSA lookup and set it to my package variable that contained my query.

I had issues in the past with full cached lookups and variables.  Fortunately, you can get around a lot of problems with expressions.

Comments closed

Type 6 Dimensions With BIML

Meagan Longoria shows us type 6 dimensions with BIML:

In my previous post, I provided the design pattern and BIML for a pure Type 2 Slowly Changing Dimension (SCD). When I say “pure Type 2 SCD”, I mean an ETL process that adds a new row for a change in any field in the dimension and never updates a dimension attribute without creating a new row.  In practice, I tend to create more hybrid Type 2 SCDs where updates to some attributes require a new row and others update the value on the existing rows. A similar pattern that I find I implement more often than a pure Type 2 is a Type 6 SCD. A Type 6 SCD builds on the Type 2 technique by adding current attributes alongside the historical attributes so related measures can be grouped by the historical or current dimension attribute values. The only difference between what I call a hybrid Type 2 and a Type 6 is that in the Type 6, there are no Type 1 attributes in the dimension that do not also have a Type 2 version in the dimension to capture the historical values.

Dear Mr. President:  there are too many types these days.  Please eliminate three.  I am NOT a crackpot.

Comments closed

Collecting ETL Metrics

Andy Leonard has a long and useful post on collecting ETL metrics in SQL Server 2016:

“In an age of the SSIS Catalog, why would one ever employ this kind of metadata collection, Andy?” That’s a fair question. The SSIS Catalog is an awesome data integration execution, logging, and externalization engine. There are a handful of use cases, though, where enterprises may opt to continue to execute SSIS packages from the file system or the MSDB database. Perhaps the biggest reason to do so is that’s the way the enterprise is currently executing SSIS. When SSDT-BI converts pre-Catalog-era (2005, 2008, 2008 R2) SSIS packages to current, it imports these packages in a “Package Deployment Model” SSIS Project. This allows developers to upgrade the version of their SSIS project to SSIS 2016 (and enjoy many benefits for so doing) while continuing to execute SSIS packages in the file system. Kudos to the Microsoft SSIS Development Team for this backwards compatibility!

Andy asks the question I wanted to ask and gives a good answer.

Comments closed

MergeUi Update

Ed Elliott has a new update to his MergeUi tool:

The way MergeUi used to work was that it enumerated the schemas and tables in a project and let you create a merge statement in the post-deploy script. The problem with that is that you may want to have different versions of the table for different environments or you may want to put the merge in a different script and either reference it using :r imports or not reference it at all.

The new way it works is that instead of enumerating tables it now enumerates script files (pre, post, included, not included etc) and lets you choose which table to add whether or not the table has been added before.

I’ve not used this tool before, but it’s good to know that it’s available via Github.

Comments closed