From the day we started the merge until the day I left, there was a WTF moment almost every day. One of the biggest issues I had with my new manager (the whole list is too long for a single blog post) was that he treated his team like unskilled laborers. He did not support them or back them up. They were expected to do anything the dev teams asked them to do any time of day or night. Holidays, weekends, whenever. If they asked to have some code deployed, the DBAs were expected to do it within half an hour during the day or within an hour any other time. They would often get 2 or 3 deployment requests overnight. 1 AM, 2 AM, 4 AM, whatever. And the person on-call not only did it within an hour every time, but they were also expected to be in the office for their normal 40 hours that week.
There’s a lot of crazy in these stories.
The year was 2010. A consultant was working on designing and deploying System Center Operations Manager (OpsMgr) and System Center Configuration Manager (ConfigMgr) for a customer. The goal was to operationalize the enterprise software deployment process and systems monitoring. The servers were prepared, SQL Server was installed, OpsMgr and ConfigMgr configured. The project was in the stabilization phase waiting for the go-live schedule.
People fail; how much they learn from failures makes the difference.
Let me tell you the story of a small BI project I did a couple of years back. The project building a small data mart for the local branch of a very large multinational in consumer goods. The data was gathered by the retailers in the country, who send the details back to the company for analyses of the sales performance. The data was to be hosted internally at the company, not at the client. However, the server was commissioned but not yet delivered. In the meantime, I would simply start developing on my own machine – which was quite a powerful laptop with 8 cores and 32GB of RAM – and we would migrate the solution once the server arrived.
Read on for what happened. I was half-expecting “And that’s when my laptop became the new server” to be the punch line.
When you spend a reasonable amount of time churning out simple form-based UI workflows, you inevitably get bored. And you try and abstract away the drudgery of building forms into something the Common Folks can manage, by building a framework, because Lord knows you can’t buy this sort of thing off the shelf. Inevitably, the arcane knowledge of how to churn out expensive work without obviously doing expensive work can safely stay within an inner circle of senior developers who will be the only people ever allowed to try and configure a no-programming-needed workflow.
Never mind. Their hearts were in the right place.
Could be worse? Maybe they could have run a WAITFOR for the number of seconds corresponding with the string’s length and then collected that information from a Windows event log entry.
It could be databases on an instance, indexes on a table, columns in a table, etc. etc. And in case you were wondering, you can get the answers here.
You see this come up every once in a while. A forum question, a question on #sqlhelp, even in *shudder* your own systems. And the answer always comes back the same: Limits are not goals! Usually accompanied by a few jokes.
For more on this, Michael Swart coined Swart’s Ten Percent Rule.
I’m not sure what it’d scare me more, if there are many [sysadmin] or just [sa] because the first one is scary, but the second involves to find out who knows the [sa] password and knowing who did what, can be a real pain in the neck.
One way or another, as I said, I want to know the different people and level of access to my server[s], so back in the day I created this stored procedure which now I want to share with you guys.
We can find all that info using DMV’s and in my case I use sys.server_principals, sys.server_role_members and sys.server_permissions and some stored procedure which I bet it’s not that well known, sys.xp_logininfo which help to get more granular picture from Windows AD Groups.
Click through for the script.
I dug into the script – it was simple – it pulled an alphabetical list of databases from system metadata and proceeded to back them up. It didn’t do this one simple thing – leave TEMPDB off the list. So when the backups got down to TEMPDB, they promptly failed. Now as a smart person – I should have just communicated this to her and had it fixed quietly. But, I was young and rather hot headed at that time. It amazed me that a DBA with several years of experience did not know that TEMPDB cannot be backed up. So, I waited until the team meeting the next day. And when the said job failure came up again – I said that I knew the reason and stated why. I also added that this was a ‘very basic thing’ that junior DBAs are supposed to know. I was stopped right there. It did not go down well. Her face was flaming red because a consultant showed her up in front of her boss in a bad light. She said she would talk to her boss and collegues the next day (several of whom were nodding their heads disapprovingly at me) and meeting was abruptly adjourned.
In this case, I don’t think there were any good actors.
What I found was that the list of IDs was being stored in a table without a clustered index. A heap. Now – I’m not opposed to heaps at all. Heaps are often very good, and shouldn’t be derided. But you need to understand something about heaps – which is that they’re not suited to tables that have a large amount of deletes. Every time you insert a row into a heap, it goes into the first available slot on the last page of the heap. If there aren’t any slots available, it creates a new page, and the story continues. It doesn’t keep track of what’s happened earlier. They can be excellent for getting data in – and Lookups are very quick because every row is addressed by the actual Row ID, rather than some key values which then require a Seek operation to find them (that said, it’s often cheap to avoid Lookups, by adding extra columns to the Include list of a non-clustered index). But because they don’t think about what kind of state the earlier pages might be in, you can end up with heaps that are completely empty, a bunch of pointers from page to page, with header information, but no actual rows therein. If you’re deleting rows from a heap, this is what you’ll get.
This guy’s heap had only a few rows in it. 8 in fact, when I looked – although I think a few moments later those 8 had disappeared, and were replaced by 13 others.
But the table was more than 400MB in size. For 8 small rows.
Read the whole thing, including Rob’s reluctance to post on this topic.
Monday morning came along and as scheduled, our Production DBAs executed my script before start of business. 30 minutes later, the frantic calls started to reach us. Seems some of our clients could no longer “make widgets!” Accounts that they needed to route data were gone! My manager and I looked at one another in horror – we were only deleting internal accounts!!! We didn’t hesitate and immediately had our Prod DBAs back out the change with my backout script, before the rest of the United Stated started business. The backout was executed immediately and all was back to normal, but business folks were pissed and wanted to know what happened.
It was good foresight to have a backout script.
Looks like I had open transactions while my transaction log got lost during an outage. I tried switching the database online but that failed.
Msg 5181, Level 16, State 5, Line 4 Could not restart database “FAT”. Reverting to the previous status. Msg 5069, Level 16, State 1, Line 4 ALTER DATABASE statement failed.
Accessing the database is the real challenge now.
Moral of the story: have backups and have good luck.