Press "Enter" to skip to content

Day: September 26, 2016

Against Visual Programming Languages

Ian Hellstrom has a critique of visual programming languages for data engineers:

Anyone with a software development background who has ever dealt with visual ETL tools may have marvelled at the lack of proper version control and diff tools that go with it. Some tools come with their own built-in VCS, while others allow you to use any or no VCS at all. The difficulty lies in the fact that the visual representation is often stored as an XML (or JSON) file. So, if a box is moved by 1 pixel, the file is different. You could argue that it’s indeed different because the layout is different, but you could equally make the case that the logic has not changed. This argument is moot though: it is technically possible to ensure that the tool auto-aligns blocks and routes/colours arrows, very much like yEd does (via menu items). Some users may not be happy with the reduced control over the way the flow looks, but others may rejoice that version control has become usable.

ETL (and ORM) tools often auto-generate code that is not particularly tuned for the data source in question. I have encountered many odd nested loops where simple hash joins would have been more appropriate if only the predicates had been pushed down properly (and if only the tool had evaluated blocks lazily). Aggregations and timestamp-based filters are also often a cause for performance issues. Again, performance is technically solvable, so this may be a valid argument against visual tools in data engineering now but perhaps not tomorrow.

This is a good argument against VPLs, although there are a couple of good arguments for VPLs, including how it’s easier to see if the overall architecture of a flow looks correct.  In the end, I like the compromise that Biml offers Integration Services developers:  write code but visualize results.

Comments closed

Comparing Impala To Redshift

Mostafa Mokhtar, et al, have a comparison of Apache Impala to Amazon Redshift:

For this analysis, we used TPC-DS on a 3TB dataset and selected 70 out of 99 the queries that run without any modifications or uses variants on both Redshift and Impala. We wanted to use a larger dataset (similar to what we’ve used in previous benchmarks), but due to Redshift’s data load times, we had to reduce the data size. (Note: This benchmark is derived from the TPC-DS benchmark and, as such, is not directly comparable to published TPC-DS results.)

This is coming from one of the two vendors, so take it with however many grains of salt you’d like.

Comments closed

SSISDB Management

Andy Leondard describes steps you can take to maintain the SSIS Catalog:

Back It Up

As with all SQL Server database, please back up SSISDB. What follows is a (very) basic guide describing one simple method to backup your SSISDB database. Please, please, please learn more about SQL Server backup and restore options and their implications before backing up an SSISDB database in your enterprise. Feel free to use the steps I describe on your laptop or a virtual machine. And please remember…

Backups are useless. Restores are priceless. Conduct practice Disaster Recovery exercises in which you restore databases and then test functionality. You’ll be glad you did. Here is a link containing Microsoft’s advice on restoring the SSISDB database in SQL Server 2016.

The advice is pretty similar to what you’d expect for any other database, but there are a couple twists around SSISDB functionality, so do read on.

Comments closed

Running Powershell In VS Code

Max Trinidad finishes his series on Powershell in Visual Studio Code.  Here’s part 2:

VS Code – Code Runner Extensions

We need to proceed to install the “Code Runner” Extension. Take a look at this extension information which can be use with many other script languages.

And here’s part 3:

VS Code – Terminal session

In Windows, we are configuring the VS Code “Integrated Terminal” to instead of executing Windows Cmd shell or Linux Bash, to use PowerShell Console.Then again, this is a quick change in the user “settings.json” in your script working folder.

It’s still amazing how far Microsoft has come from the Ballmer days.

Comments closed

Automating Data Warehouse Testing

Koos van Strien discusses warehouse testing:

Case: we’ve integrated two sources of customers. We want to add a third source.

Q: How do we at the same time know that our current integration and solutions will continue to work while at the same time integrating the new sources?

A: Test it.

Q: How do we get faster deployments and more stability?

A: Automate the tests, so they can run continuously.

This is an interesting concept; do read the whole thing.

Comments closed

Storing R Graphs In FileTable

Tomaz Kastrun shows how to save R plots in SQL Server FileTable:

FileTable has been around now for quite some time and and it is useful  for storing files, documents, pictures and and binary files in a designated SQL Server table – FileTable. The best part of FileTable is the fact one can access it from windows or other application as if it were stored on file system (because they are) and not making any other changes on the client.

And this feature is absolutely handy for using and storing outputs from Microsoft R Server. In this blog post I will focus mainly on persistently storing charts from statistical analysis.

I can see this being quite useful for things like automatically sampling data for quality control.

Comments closed

Indexing Foreign Keys

Kendra Little answers a question about indexing foreign key constraints:

We recently had a SQL Server performance assessment. It remarked on two things that made me think:

1 ) tables with foreign keys but missing supporting index
2 ) tables with indexes that are not used

But in our case the remark in Case 2 was on a index that supports a foreign key!

Curtain down!

Now to my question, in which cases should you as a rule create indexes to support a foreign key?

I think Kendra elaborates the pros and cons well.  I’d lean against automatically creating indexes because I’ve worked with scenarios in which you don’t drive from the Parent to the Child; instead, you drive from Child to Parent or Something to Child to Parent and those indexes go unused.

Comments closed

Creating Snippets With Powershell

Allison Tharp has a script to create SSMS snippets:

One issue with SSMS’s Snippet Manager is that the code has to be stored in a .Snippet file in a specific location on your local machine.  The .Snippet files are in XML format, which is a little inconvenient since we don’t write SQL in XML format.  This means that in order to add a new snippet, you need to create a new .Snippet file in the correct format, add the SQL to it, and save it in the right location.  After that, you must import the file through SSMS.

I have had ideas for snippets, but it is enough of an inconvenience to sometimes prevent me from going through the motions of making an actual snippet file.  I wanted to write some kind of code that would allow me to save a SQL file into a specific folder and let an automated program convert it into the Snippet format.  I’ve also been wanting to learn PowerShell, so I thought this would be a good opportunity!

The next step would be turning that into a cmdlet.

Comments closed

Power BI Gateway Troubleshooting

Adam Saxton has four tips to assist with troubleshooting On-Premesis Data Gateway issues within Power BI:

We know that the problem is remote applications being unable to connect to SQL Server. I want to simplify this and get the gateway out of the picture.

A lot of times we can become fixated on the app we are looking at and ignore other possibilities. You may know that locally works and remote does and start looking at why the gateway specifically isn’t working. Maybe you start to think something on the Power BI side is broken, or maybe we can’t talk to the Azure Service Bus. This is why I like to remove the gateway from the picture. Or, if you had a custom application, let’s remove that from the picture. Simplify as best you can. This helps to exclude a lot of potential complexity.

How do we do that? You can use the same test that you used locally! Use something like a UDL file or another tool to see if it can connect. Because it is SQL Server, I’d really like to try Management Studio. Management Studio uses .NET as does the gateway. To start, it doesn’t necessarily have to be the same machine as the gateway. If you have a different machine that has Management Studio installed, use that! If that works, then we may want to do the test specifically from the gateway machine.

This is a good troubleshooting guide, dealing with some of the more likely causes before moving on to the esoteric issues.

Comments closed

Automating Deadlock Execution Plan Collection

Michael J. Swart comes up with a system to collect execution plans at the time of a deadlock and log them to a table for further research:

So How Do I Get To The Execution Plans?
So when I look at a deadlock graph, I can see there are sql_handles. Given that, I can grab the plan_handle and then the query plan from the cache, but I’m going to need to collect it automatically at the time the deadlock is generated. So I’m going to need

  • XML shredding skills

  • Ability to navigate DMVs to get at the cached query plans

  • A way to programatically respond to deadlock graph events (like an XE handler or a trigger)

If you don’t have the funding to get a third-party tool in place which collects this information, this could be a good fit.

Comments closed