In our next step, we loop through all tables in that database (feel free to limit the results by playing with GetDatabaseSchema) and create a FlatFileFormat for each of them. We will include all columns except those with datatype Binary or Object. As flatfiles don’t really care about actual data formats, we will just define every column as a string with maximum length. We will also add an annotation with the table’s original name, the list of columns as well as a list of primary keys (we’ll need the latter for a later step :)):
Like most Biml-related things, it’s not that many lines of code, so check it out.
The bulk-load tools have been in the product for a long time and they are showing their age. When they work for you, they are powerful. But you need to understand that these tools are binary to their heart, and they have no built-in rule that says that each line a file is a record – they don’t even think in lines. You also need to understand that there are file formats they are not able to handle.
I have tried to arrange the material in this article so that if you have a simple problem, you only need to read the first two chapters after the introduction. I first introduce you them to their mindset, which is likely to be different from yours. Next I cover the basic options to use for every-day work. If you have a more complex file, you will need to use a format file and the next three chapters are for you. I first describe how format files work as such, and the next two chapters show how to use format files for common cases for import and export respectively. This is followed by a chapter about Unicode files, including files encoded in UTF‑8. Then comes a chapter about “advanced” options, including how to load explicit values into an IDENTITY column. A short chapter covers permissions. The last chapter discusses XML format files, and I am not sorry at all if you give this chapter a blind eye – I find XML format files to be of dubious value.
I haven’t had a chance to read this yet, but because I have never had good luck with bcp and BULK INSERT, it’s on my to-read list.
Ben Weissman has a two-part series on loading a set of tables based on foreign key constraints. Part 1 is linear loads:
All our previous posts were running data loads in parallel, ignoring potential foreign key constraints. But in real life scenarios, your datawarehouse may actually have tables refering to each other using such, meaning that it is crucial to create and populate them in the right order.
In this blog post, we’ll actually solve 2 issues at once: We’ll provide a list of tables, will then identify any tables that our listed tables rely on (recursively) and will then create and load them in the right order.
In this sample, we’ll use AdventureWorksDW2014 as our source and transfer the FactInternetSales-table as well as all tables it is connected to through foreign key constraints. Eventually, we will create all these tables including the constraints in a new database AdventureWorksDW2014_SalesOnly (sorting them so we get no foreign key violations) and eventually populate them with data.
After the first excitment about how easy it actually was to take care of that topology, you might ask yourself: Why does it have to run linear? That takes way too long. And you’re right – and it doesn’t have to.
All we need to do is:
– Create a list of all the tables that we’ve already loaded (which will be empty at that point)
– Identify all tables that do not reference any other tables
– Load these tables, each followed by all tables that only reference this single table – recursively and add them to list of loaded tables
– Once that is done, load all tables that are referencing multiple tables where all required tables have been loaded before – and again, add them to the list
– Repeat this until no table is left to load (or for a maximum of 10 times in this example)
– If, for whichever reason, any tables are left, load them sequentially using the TopoSort function:
This is a very interesting way of using Biml to traverse the foreign key tree. I’ve normally used recursive CTEs in T-SQL to do the same, but I’ll have to play around with this method.
The main things to configure is DBCPConnection Pool and Maximum-value Columns
Please choose this to be the date-time stamp column that could be a cumulative change-management column
This is the only limitation with this processor as it is not a true CDC and relies on one column. If the data is reloaded into the column with older data the data will not be replicated into HDFS or any other destination.
This processor does not rely on Transactional logs or redo logs like Attunity or Oracle Goldengate. For a complete solution for CDC please use Attunity or Oracle Goldengate solutions.
That last paragraph in the snippet is key: it’s not a true replacement for CDC-friendly products. It is, however, a good example for showing how to use NiFi to connect to a relational database and pump data out of it.
Unfortunately, while developing Data Factory I became very familiar with errors. All of the errors show up at the end and provide very little insight as to what in the process failed. Here’s an example.
Database operation failed on server ‘Sink:DBName01.database.windows.net’ with SQL Error Number ‘40197’. Error message from database execution : The service has encountered an error processing your request. Please try again. Error code 4815. A severe error occurred on the current command. The results, if any, should be discarded.
This sounds like classic Microsoft error messages: “An error occurred. Here is a code you can put into Google and hope desperately that someone has already figured out the answer. Good luck!”
Processes that publish messages to a Kafka topic are called “producers.” “Topics” are feeds of messages in categories that Kafka maintains. The transactions from RDBMS will be converted to Kafka topics. For this example, let’s consider a database for a sales team from which transactions are published as Kafka topics. The following steps are required to set up the Kafka producer
I’d call this a non-trivial but still straightforward exercise. Step 1 from the SQL Server side could be reading from transaction logs (which would be the least-intrusive), but you could also set up something like change tracking and fire off messages when important tables’ records change.
There are instances where data resides in Azure Blob Storage and the data is needed in a SQL database. For example, if one ran a Machine Learning experiment in Data Factory, the results would be stored in Azure Blob storage, and for analysis purposes, it may make a lot more sense to move the data to SQL database. Moving data around in Data Factory, means writing JSON. In this example we will be using an Azure SQL DB, but it is not essential that the data be stored in Azure. An on-premises SQL Server could also be used, as long as a gateway was added for the connection, the other steps would be the same. There are five different Data Factory elements required to move data from an Azure blob to a database: a pipeline for the data, a data set containing the definition for the blob, a linked service for the blob, a data set containing a definition for the SQL Data, and a linked service to connect to the SQL database.
There’s a lot of JSON ahead.
The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Pagegrid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService
Read the whole thing.
In this case, the error message is quite clear. There is more than one row in the source (staging) that matches a single row in the target (data warehouse). When we are warehousing data, we setup key fields that allow us to match up a record in staging to a record in the data warehouse. In most systems, you can use the source system’s primary key to accomplish this. After all, most systems use a RDBMS of some sort to store data. However, in this case the source data is from a SharePoint list, and the only source key available is a list item ID.
So why are we not using that? There is a very simple answer and that is because end users delete old data from the list, which can lead to a recycling of ID values from SharePoint. If an ID gets recycled, then the data warehouse will improperly overwrite data in the fact table or discard the new row as a duplicate depending on how we configure the extract routine.
Figuring out the cause of the problem is a multi-step process, as Jesse shows.
Because we need a connection to a database and a Azure Blob, two Linked Services are required, one for each different type. Prior to completing this step, create an Azure Blob storage account by clicking on Add on All Resources. Create the second Linked service, like the first. Click on New data store then select Azure Storage. Using the template for an Azure Blob Storage linked services, I have modified it below adding the “hubName” as it is required
There’s a lot of JSON to write here, if you’re into that sort of thing.