I had over 140 files to process. That’s not usually a big deal – I normally use SQL Server Integration Services to loop through network folders, connect to hundreds of spreadsheets and extract the source data.
But this relies on the data being in a tabular format (like a dataframe or database table).
A quick glance at the first few sheets confirmed I could not use this approach – the data was not in tabular format. Instead it was laid out in a format suited to viewing the data on screen – with the required data scattered in different ranges throughout each sheet ( over 100 rows and many columns). It wasn’t going to be feasible to point SSIS at different locations within each sheet. (It can be done, but it’s pretty complex and I didn’t have time to experiment).
The other challenge was that over time, changes to design meant that data moved location e.g. dates that were originally in cell C2 moved to D7, then moved again as requirements evolved. There were 14 different templates in all, each with subtle changes. Each template was going to need a custom solution to extract the data.
This is a good look at how R can be about more than “just” statistical analysis.
The Power BI team has recently released an enhanced “combine binaries” experience as part of November 2016 update to Power BI Desktop. (Jargon Alert: “Combine Binaries” is a scary term. Instead it should be named “Magically combine multiple files together into one table and make me SUPER happy.”) The improved experience can drastically help you to import multiple Excel or other files from a folder and avoid writing advanced query functions. But today we will focus on a specific scenario, which is so common that it deserves this special post – Handling CSV files.
In fact, today’s blog post is actually the first post in “The CSV Series”. I hope you will enjoy it. To celebrate the November update of Power BI Desktop, we will review the improved experience, and will walk you through one of the most common scenarios that is now so easy to implement – Importing multiple CSV files from a folder, including parts of their filenames.
This looks very useful.
Now if this table is paritioned you’d use SWITCH and bring in a new partition.
For those that don’t know, when a table is partitioned, you can create a new empty partition, and a new empty table, load the table, make the table exactly match the partition (structure, check constraints, & indexes for example) and you can SWITCH it in. The SWITCH part is a metadata operation and is fast!
Read on for the details. The upshot is that you can take your time loading the second table and once you’re ready to swap out, it’s a quick metadata change. That’s really useful for ETL scenarios.
In our next step, we loop through all tables in that database (feel free to limit the results by playing with GetDatabaseSchema) and create a FlatFileFormat for each of them. We will include all columns except those with datatype Binary or Object. As flatfiles don’t really care about actual data formats, we will just define every column as a string with maximum length. We will also add an annotation with the table’s original name, the list of columns as well as a list of primary keys (we’ll need the latter for a later step :)):
Like most Biml-related things, it’s not that many lines of code, so check it out.
The bulk-load tools have been in the product for a long time and they are showing their age. When they work for you, they are powerful. But you need to understand that these tools are binary to their heart, and they have no built-in rule that says that each line a file is a record – they don’t even think in lines. You also need to understand that there are file formats they are not able to handle.
I have tried to arrange the material in this article so that if you have a simple problem, you only need to read the first two chapters after the introduction. I first introduce you them to their mindset, which is likely to be different from yours. Next I cover the basic options to use for every-day work. If you have a more complex file, you will need to use a format file and the next three chapters are for you. I first describe how format files work as such, and the next two chapters show how to use format files for common cases for import and export respectively. This is followed by a chapter about Unicode files, including files encoded in UTF‑8. Then comes a chapter about “advanced” options, including how to load explicit values into an IDENTITY column. A short chapter covers permissions. The last chapter discusses XML format files, and I am not sorry at all if you give this chapter a blind eye – I find XML format files to be of dubious value.
I haven’t had a chance to read this yet, but because I have never had good luck with bcp and BULK INSERT, it’s on my to-read list.
Ben Weissman has a two-part series on loading a set of tables based on foreign key constraints. Part 1 is linear loads:
All our previous posts were running data loads in parallel, ignoring potential foreign key constraints. But in real life scenarios, your datawarehouse may actually have tables refering to each other using such, meaning that it is crucial to create and populate them in the right order.
In this blog post, we’ll actually solve 2 issues at once: We’ll provide a list of tables, will then identify any tables that our listed tables rely on (recursively) and will then create and load them in the right order.
In this sample, we’ll use AdventureWorksDW2014 as our source and transfer the FactInternetSales-table as well as all tables it is connected to through foreign key constraints. Eventually, we will create all these tables including the constraints in a new database AdventureWorksDW2014_SalesOnly (sorting them so we get no foreign key violations) and eventually populate them with data.
After the first excitment about how easy it actually was to take care of that topology, you might ask yourself: Why does it have to run linear? That takes way too long. And you’re right – and it doesn’t have to.
All we need to do is:
– Create a list of all the tables that we’ve already loaded (which will be empty at that point)
– Identify all tables that do not reference any other tables
– Load these tables, each followed by all tables that only reference this single table – recursively and add them to list of loaded tables
– Once that is done, load all tables that are referencing multiple tables where all required tables have been loaded before – and again, add them to the list
– Repeat this until no table is left to load (or for a maximum of 10 times in this example)
– If, for whichever reason, any tables are left, load them sequentially using the TopoSort function:
This is a very interesting way of using Biml to traverse the foreign key tree. I’ve normally used recursive CTEs in T-SQL to do the same, but I’ll have to play around with this method.
The main things to configure is DBCPConnection Pool and Maximum-value Columns
Please choose this to be the date-time stamp column that could be a cumulative change-management column
This is the only limitation with this processor as it is not a true CDC and relies on one column. If the data is reloaded into the column with older data the data will not be replicated into HDFS or any other destination.
This processor does not rely on Transactional logs or redo logs like Attunity or Oracle Goldengate. For a complete solution for CDC please use Attunity or Oracle Goldengate solutions.
That last paragraph in the snippet is key: it’s not a true replacement for CDC-friendly products. It is, however, a good example for showing how to use NiFi to connect to a relational database and pump data out of it.
Unfortunately, while developing Data Factory I became very familiar with errors. All of the errors show up at the end and provide very little insight as to what in the process failed. Here’s an example.
Database operation failed on server ‘Sink:DBName01.database.windows.net’ with SQL Error Number ‘40197’. Error message from database execution : The service has encountered an error processing your request. Please try again. Error code 4815. A severe error occurred on the current command. The results, if any, should be discarded.
This sounds like classic Microsoft error messages: “An error occurred. Here is a code you can put into Google and hope desperately that someone has already figured out the answer. Good luck!”
Processes that publish messages to a Kafka topic are called “producers.” “Topics” are feeds of messages in categories that Kafka maintains. The transactions from RDBMS will be converted to Kafka topics. For this example, let’s consider a database for a sales team from which transactions are published as Kafka topics. The following steps are required to set up the Kafka producer
I’d call this a non-trivial but still straightforward exercise. Step 1 from the SQL Server side could be reading from transaction logs (which would be the least-intrusive), but you could also set up something like change tracking and fire off messages when important tables’ records change.
There are instances where data resides in Azure Blob Storage and the data is needed in a SQL database. For example, if one ran a Machine Learning experiment in Data Factory, the results would be stored in Azure Blob storage, and for analysis purposes, it may make a lot more sense to move the data to SQL database. Moving data around in Data Factory, means writing JSON. In this example we will be using an Azure SQL DB, but it is not essential that the data be stored in Azure. An on-premises SQL Server could also be used, as long as a gateway was added for the connection, the other steps would be the same. There are five different Data Factory elements required to move data from an Azure blob to a database: a pipeline for the data, a data set containing the definition for the blob, a linked service for the blob, a data set containing a definition for the SQL Data, and a linked service to connect to the SQL database.
There’s a lot of JSON ahead.