Press "Enter" to skip to content

Curated SQL Posts

When MS_SSISServerCleanupJobLogin Is Orphaned

Sreekanth Bandarla noticed a problem in cleaning up SSIS metadata:

Couple of weeks ago I was analyzing a server for space and noticed SSISDB database was abnormally huge (this Instance was running just a handful of packages). I noticed couple of internal schema tables in SSISDB were huge (with some hundreds of millions of rows), well that’s not right. There should be SSIS Server maintenance job which SQL server creates to purge older entries based on the retention settings right? My immediate action was to check the retention period set and what’s the status of the job.  As I suspected, the job was failing (looks like this has been failing since ages) with below error.

The job failed.  The Job was invoked by Schedule 9 (SSISDB Scheduler).  The last step to run was step 1 (SSIS Server Operation Records Maintenance).
Execute as Login failed for the requested login ‘##MS_SSISServerCleanupJobLogin##’

Read on for the root cause and solution.

Comments closed

Rant: NoSQL Isn’t A Thing

Grant Fritchey is in rare form today:

Go and search through it for a NoSQL data management system. I’ll wait.

You found one that was named NoSQL didn’t you. Oracle NoSQL, because Oracle. Of course. However, under the Database Model what did it say? Document Store. Why?

BECAUSE NOSQL IS NOT A DATA STORAGE ENGINE!

There is not a NoSQL thing to use. You can’t compare NoSQL to anything because NoSQL is just a different screed, a rant, a tantrum. A bunch of influential developers threw themselves on the floor, screaming and kicking, spittle flying everywhere, because they didn’t want to eat their broccoli.

Wait, that was my kids when they were three.

As the voice of reason (and you know you’re in trouble when that’s the case!), non-relational engines take up 6 of the top 15 slots.  There are particular advantages to non-relational data stores—as Grant willingly notes—and as companies grow, specialized data storage can become quite useful.  But relational databases are a great starting point and for good reason:  they solve a subset of problems extremely well and solve most problems well enough.  This is probably also a good place to drop in a reference to Feasel’s Law.

Comments closed

Default Parameter Values In Powershell

Chrissy LeMaire explains what $PSDefaultParameterValues is and places where she finds it useful:

$PSDefaultParameterValues is a hashtable available in PowerShell that can set defaults for any command that you run. In it’s simplest form, setting a default parameter value can look like this:

After running the above code, Get-DbaDatabase will show verbose output every time it’s executed, without me having to specify -Verbose. If I need to override that verbose flag for some reason, I can simply add -Verbose:$false to my Get-DbaDatabase command.

Read on for plenty of good use cases and additional resources from sharp people.

Comments closed

Working With Kafka At Scale

Tony Mancill has some tips for working with large-scale Kafka clusters:

Unless you have architectural needs that require you to do otherwise, use random partitioning when writing to topics. When you’re operating at scale, uneven data rates among partitions can be difficult to manage. There are three main reasons for this:

  • First, consumers of the “hot” (higher throughput) partitions will have to process more messages than other consumers in the consumer group, potentially leading to processing and networking bottlenecks.

  • Second, topic retention must be sized for the partition with the highest data rate, which can result in increased disk usage across other partitions in the topic.

  • Third, attaining an optimum balance in terms of partition leadership is more complex than simply spreading the leadership across all brokers. A “hot” partition might carry 10 times the weight of another partition in the same topic.

There’s some interesting advice in here.

Comments closed

Kafka Blindness

George Vetticaden and Houshang Livian look at a common problem with Apache Kafka installations:

Over the last 12 months, the product team has been talking to our largest Kafka customers who are using this technology to implement a diverse set of use cases. We posed to them the following question:

What are your key challenges with using Kafka in production? What do you need to be successful with this powerful technology?

The most common response was the need for better tools to monitor and manage Kafka in production. Specifically, users wanted better visibility in understanding what is going on in the cluster across the four key entities with Kafka: producers, topics, brokers, and consumers.  In fact, because we heard this same response over and over from the users we interviewed, we gave it a name: The Kafka Blindness.

Kafka’s Omnipresence has led to Kafka blindness – the enterprise’s struggle to monitor, troubleshoot and see whats happening in their Kafka clusters.

It looks like the folks at Hortonworks are building tooling around visualizing Kafka topic status.  There are a bunch of these tools out there (each one typically with its own focus and blind spots), so we’ll see how theirs stacks up.

Comments closed

Scaling Kafka With Kafka-Kit

Jamie Alquiza announces Kafka-Kit:

Kafka-Kit is a collection of tools that handle partition to broker mappings, failed broker replacements, storage based partition rebalancing, and replication auto-throttling. The two primary tools are topicmappr and autothrottle.

These tools cover two categories of our Kafka operations: data placement and replication auto-throttling.

It looks like an interesting project, and is available on GitHub.

Comments closed

Getting Started With Azure Databricks

David Peter Hansen has a quick walkthrough of Azure Databricks:

RUN MACHINE LEARNING JOBS ON A SINGLE NODE

A Databricks cluster has one driver node and one or more worker nodes. The Databricks runtime includes common used Python libraries, such as scikit-learn. However, they do not distribute their algorithms.

Running a ML job only on the driver might not be what we are looking for. It is not distributed and we could as well run it on our computer or in a Data Science Virtual Machine. However, some machine learning tasks can still take advantage of distributed computation and it a good way to take an existing single-node workflow and transition it to a distributed workflow.

This great example notebooks that uses scikit-learn shows how this is done.

Read the whole thing.

Comments closed

Considerations When Using Sort By Column In DAX

Marco Russo shows us some things to keep in mind when using Sort By Column in DAX:

A query in MDX automatically inherits the correct column sort order from the data model; the result of an MDX query is always sorted according to the Sort By Column setting. However, DAX does not have any implicit sort order for the columns other than the natural sort order of the underlying data type. For this reason, a DAX query must always specify the sort order in an ORDER BY condition – similarly to a query in SQL. Because DAX requires for a column used in ORDER BY to be part of the query result, a Power BI visual that sorts a column always generates a query that includes at least two columns: the column requested in the report and the underlying column used in the Sort By Column setting. In other words, a Power BI visual showing data by Month must generate a query that contains both Month Name and Month Number, whereas MDX only requires the Month Name column.

There’s also an interesting example where Power BI behaves differently from Excel.

Comments closed

Tips For Troubleshooting Code Problems

Bert Wagner shares some techniques he uses to troubleshoot code:

1. Rubber Duck Debugging

The first thing I usually do when I hit a wall like this is talk myself through the problem again.

This technique usually works well for me and is equivalent to those times when you ask  someone for help but realize the solution while explaining the problem to them.

To save yourself embarrassment (and to let your coworkers keep working uninterrupted), people often substitute an inanimate object, like a rubber duck, instead of a coworker to try and work out the problem on their own.

Alas, in this case explaining the problem to myself didn’t help, so I moved on to the next technique.

This one works more often than you might expect, and is a big part of the value behind pair programming.

Comments closed

Visualizing Deadlocks In SQL Sentry & Plan Explorer

Aaron Bertrand shows off new functionality in SQL Sentry and SentryOne Plan Explorer around deadlock visualization:

There’s a lot going on there, but much of it is noise. There is a whole bunch of contention on the table SqlPerf.Session — session 342 is trying to perform an update, but it is stuck waiting on shared locks taken by two services. Now, let’s check the Optimize Layout box above, and look at the circular graph again. Simplified, right?

This checkbox is easily the most powerful option to discard noise and help you focus on the crux of the deadlock issue. In the original graph, you can see that many of the elements presented are simply innocent bystanders — waiters that are captured as part of the deadlock activity, but in no way contributing to it. We can detect this in a lot of cases and so, when you check the box, we hide them from view, allowing you to focus much more directly on the key players involved in the deadlock. There is no question that eliminating the noise can really speed up troubleshooting; with those extra nodes removed, I can clearly see that I have some kind of order-of-operations issue on the SqlPerf.Session table, between the transfer service and the processor service.

Very cool.

Comments closed