How often do you need to play audio while you’re compiling your Biml packages? Never? Really? Huh, just me then. Very well, chalk this blog post as one to show you that you really can do *anything* in Biml that you can do in C#.
When I first learned how I can play audio in .NET, I would hook the Windows Media Player dll and use that. The first thing I then did was create an SSIS package that had a script task which played the A-Team theme song while it ran. That was useless but a fun demo. Fast forward to using Biml and I could not for the life of me get the Windows Media Player to correctly embed in a Biml Script Task. I suspect it’s something to do with the COM bindings that Biml doesn’t yet support. Does this mean you shouldn’t use Biml – Hell no. It just means I’ve wandered far into a corner case that doesn’t yet have support.
Read on because it will make you a better person.
This post uses objects and annotations from our previous post “Export to Flatfiles with Biml”. Please use the code from that post as a prerequisit.
In the previous post, we’ve exported the whole database to flatfiles with one file per table. But what if we want to split large tables into multiple files? One easy way to do that would be to retrieve the data using OFFSET-FETCH NEXT from SQL Server.
Read on for more.
In our next step, we loop through all tables in that database (feel free to limit the results by playing with GetDatabaseSchema) and create a FlatFileFormat for each of them. We will include all columns except those with datatype Binary or Object. As flatfiles don’t really care about actual data formats, we will just define every column as a string with maximum length. We will also add an annotation with the table’s original name, the list of columns as well as a list of primary keys (we’ll need the latter for a later step :)):
Like most Biml-related things, it’s not that many lines of code, so check it out.
For each member of that collection, we follow some simple rules:
– Our table’s original name is the name of the table in the staging area without our connectionname prefix
– If our tablename still includes an underscore, we will split the name and assign the table- and schemaname respectively. Otherwise, our schema will be DBO.
– Create a DELETE statement towards our metadata store
– Create an INSERT statement towards our metadata store
Admittedly, I would have seen this as a one-time process and would have just written some scripts against sys.tables and sys.columns to generate this metadata, but “one-time processes” tend to happen over and over.
This little piece of Biml will check all your tables for indices sharing the same columns.
It does not generate any SSIS tasks etc. but might be a good starting point to build your own Index-Monitoring or Index-Clinic – because Biml is NOT just for SSIS
Depending upon your definition of a duplicate index, this might generate false positives. Regardless, it’s a nice way of showing that Biml is about more than SSIS.
Using tooling is always a trade-off between time/frustration and monetary cost. BIDS Helper/BimlExpress are free so you’re prioritizing cost over all others. And that’s ok, there’s no judgement here. I know what it’s like to be in places where you can’t buy the tools you really need. One of the hard parts about debugging the expanded Biml from BimlScript is you can’t see the intermediate or flat Biml. You’ve got your Metadata, Biml and BimlScript and a lot of imagination to think through how the code is being generated and where it might be going wrong. That’s tough. Even at this point where I’ve been working with it for four years, I can still spend hours trying to track down just where the heck things went wrong. SPOILER ALERT It’s the metadata, it’s always the metadata (except when it’s not). I end up with NULLs where I don’t expect it or some goofball put the wrong values in a field. But how can you get to a place where you can see the result? That’s what this post is about.
It’s a trivial bit of code but it’s important. You need to add a single Biml file to your project and whenever you want to see the expanded Biml, prior to it being translated into SSIS packages, right click on the file and you’ll get all that Biml dumped to a file. This recipe calls for N steps.
This is a good tip and has helped me a few times in the past.
As I recently got asked for it in a talk, this piece of code gives you all the Views in a database that are currently broken.
This could be useful for “what if”-scenarios when playing with your metadata.
Click through for the code. This is another in Ben’s enjoyable ongoing series of non-ETL things you can do with Biml.
There is no attribute in the Connections collection to assign a guid. It’s simply not there. If you want to associate an Id with an instance of a Connection your choices are the Project node and the Package node. Since we’re dealing with project level connection managers, we best cover both bases to ensure Ids synchronize across our project. If you wish, you could have embedded this Projects node in with the Connections but then you’d have to statically set these Ids. I feel like showing off so we’ll go dynamic.
To start, I define a list of static GUID values in the beginning of my file. Realistically, we have these values in a table and we didn’t go with “known” values. The important thing is that we will always map a guid to a named connection manager. If you change a connection manager’s definition from being project level to non, or vice versa, this will result in the IDs shifting and you’ll see the same symptoms as above.
There’s plenty of code over on Bill’s site to help you as well.
Ben Weissman has a two-part series on loading a set of tables based on foreign key constraints. Part 1 is linear loads:
All our previous posts were running data loads in parallel, ignoring potential foreign key constraints. But in real life scenarios, your datawarehouse may actually have tables refering to each other using such, meaning that it is crucial to create and populate them in the right order.
In this blog post, we’ll actually solve 2 issues at once: We’ll provide a list of tables, will then identify any tables that our listed tables rely on (recursively) and will then create and load them in the right order.
In this sample, we’ll use AdventureWorksDW2014 as our source and transfer the FactInternetSales-table as well as all tables it is connected to through foreign key constraints. Eventually, we will create all these tables including the constraints in a new database AdventureWorksDW2014_SalesOnly (sorting them so we get no foreign key violations) and eventually populate them with data.
After the first excitment about how easy it actually was to take care of that topology, you might ask yourself: Why does it have to run linear? That takes way too long. And you’re right – and it doesn’t have to.
All we need to do is:
– Create a list of all the tables that we’ve already loaded (which will be empty at that point)
– Identify all tables that do not reference any other tables
– Load these tables, each followed by all tables that only reference this single table – recursively and add them to list of loaded tables
– Once that is done, load all tables that are referencing multiple tables where all required tables have been loaded before – and again, add them to the list
– Repeat this until no table is left to load (or for a maximum of 10 times in this example)
– If, for whichever reason, any tables are left, load them sequentially using the TopoSort function:
This is a very interesting way of using Biml to traverse the foreign key tree. I’ve normally used recursive CTEs in T-SQL to do the same, but I’ll have to play around with this method.
Anyone with a software development background who has ever dealt with visual ETL tools may have marvelled at the lack of proper version control and diff tools that go with it. Some tools come with their own built-in VCS, while others allow you to use any or no VCS at all. The difficulty lies in the fact that the visual representation is often stored as an XML (or JSON) file. So, if a box is moved by 1 pixel, the file is different. You could argue that it’s indeed different because the layout is different, but you could equally make the case that the logic has not changed. This argument is moot though: it is technically possible to ensure that the tool auto-aligns blocks and routes/colours arrows, very much like yEd does (via menu items). Some users may not be happy with the reduced control over the way the flow looks, but others may rejoice that version control has become usable.
ETL (and ORM) tools often auto-generate code that is not particularly tuned for the data source in question. I have encountered many odd nested loops where simple hash joins would have been more appropriate if only the predicates had been pushed down properly (and if only the tool had evaluated blocks lazily). Aggregations and timestamp-based filters are also often a cause for performance issues. Again, performance is technically solvable, so this may be a valid argument against visual tools in data engineering now but perhaps not tomorrow.
This is a good argument against VPLs, although there are a couple of good arguments for VPLs, including how it’s easier to see if the overall architecture of a flow looks correct. In the end, I like the compromise that Biml offers Integration Services developers: write code but visualize results.