I prefer to passively gather performance metrics – even if it’s a little convoluted and more work (for me). I don’t often need these metrics immediately available, so I execute queries similar to the query below and store the performance metrics in a table. I can (and do) build dashboards to track SSIS performance (perhaps I should blog about them…) using this passively-collected performance metrics.
As with Lookup Transformation messages, OLE DB Destinations in Data Flow Tasks record specific messages using the same format every time. Because of this (hard-coded) consistency, you and I can passively collect the number of rows written while executing packages in the SSIS Catalog using the (default) Basic logging level. We can use the following Transact-SQL query to collect this execution metadata post-execution:
Click through for the script.
As Jen points out earlier in her Analytics Market Commoditization and Consolidation post (you should read it all – it’s awesome – like all of Jen’s posts!) many analytics solution providers share the “Same look, same marketing story, same saves time and allows users [to] avoid evil IT.”
I can hear some of you thinking, “Are you telling us analytics doesn’t work, Andy?” Goodness no. I’m telling you hype and sales strategy work in the analytics market as well as anywhere. When asked why a solution may not perform to expectations, the #1 response is “your data is not clean.”
Data engineering (think ETL specifically designed for analytics and “big data”) is the backbone behind data science. To Andy’s point, the data engineer’s job is to get clean, context-heavy data in front of a data scientist, the same way a “classical” Business Intelligence specialist works with analysts.
Customers coordinates: a flat file containing x,y coordinates for every customer.
Municipalities in Austria: a shape file with multi-polygons defining the municipalities areas in Austria: source
The goal was to “look-up” the coordinates in the shape file in order to get the municipality code GKZ which in german stand for “Gemeindekennzahl”.
Check out the demo.
Sometimes, you need to backup your current running flow, let that flow run at a later date, or make a backup of what is in-process now. You want this in a permanent storage and want to reconstitute it later like orange juice and add it back into the flow or restart it.
This could be due to failures, for integration testing, for testing new versions of components, as a checkpoint, or for many other purposes. You don’t always want to reprocess the original source or files (they may be gone).
Read on for an explanation of how FlowFile streams can do this.
But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).
Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.
Read on for the full details on this workaround. Also, vote on this feedback item if you want the ability to move files instead of just copying them.
Azure SQL Database enables you to directly load files stored on Azure Blob Storage using the BULK INSERT T-SQL command and OPENROWSET function.
Loading content of files form Azure Blob Storage account into a table in SQL Database is now single command
Click through for the details.
I had over 140 files to process. That’s not usually a big deal – I normally use SQL Server Integration Services to loop through network folders, connect to hundreds of spreadsheets and extract the source data.
But this relies on the data being in a tabular format (like a dataframe or database table).
A quick glance at the first few sheets confirmed I could not use this approach – the data was not in tabular format. Instead it was laid out in a format suited to viewing the data on screen – with the required data scattered in different ranges throughout each sheet ( over 100 rows and many columns). It wasn’t going to be feasible to point SSIS at different locations within each sheet. (It can be done, but it’s pretty complex and I didn’t have time to experiment).
The other challenge was that over time, changes to design meant that data moved location e.g. dates that were originally in cell C2 moved to D7, then moved again as requirements evolved. There were 14 different templates in all, each with subtle changes. Each template was going to need a custom solution to extract the data.
This is a good look at how R can be about more than “just” statistical analysis.
The Power BI team has recently released an enhanced “combine binaries” experience as part of November 2016 update to Power BI Desktop. (Jargon Alert: “Combine Binaries” is a scary term. Instead it should be named “Magically combine multiple files together into one table and make me SUPER happy.”) The improved experience can drastically help you to import multiple Excel or other files from a folder and avoid writing advanced query functions. But today we will focus on a specific scenario, which is so common that it deserves this special post – Handling CSV files.
In fact, today’s blog post is actually the first post in “The CSV Series”. I hope you will enjoy it. To celebrate the November update of Power BI Desktop, we will review the improved experience, and will walk you through one of the most common scenarios that is now so easy to implement – Importing multiple CSV files from a folder, including parts of their filenames.
This looks very useful.
Now if this table is paritioned you’d use SWITCH and bring in a new partition.
For those that don’t know, when a table is partitioned, you can create a new empty partition, and a new empty table, load the table, make the table exactly match the partition (structure, check constraints, & indexes for example) and you can SWITCH it in. The SWITCH part is a metadata operation and is fast!
Read on for the details. The upshot is that you can take your time loading the second table and once you’re ready to swap out, it’s a quick metadata change. That’s really useful for ETL scenarios.
In our next step, we loop through all tables in that database (feel free to limit the results by playing with GetDatabaseSchema) and create a FlatFileFormat for each of them. We will include all columns except those with datatype Binary or Object. As flatfiles don’t really care about actual data formats, we will just define every column as a string with maximum length. We will also add an annotation with the table’s original name, the list of columns as well as a list of primary keys (we’ll need the latter for a later step :)):
Like most Biml-related things, it’s not that many lines of code, so check it out.