One day, a manager asked me if I could help on an urgent matter: the application suddenly could no longer execute transactions on the production database and the database connection was intermittently failing. The system admin was busy with other duties, so I was the closest thing they had to a DBA. All they could tell me was the production database had crashed and they got an error message about insufficient disk space.
Click through for the rest of the story.
But what if the caller wanted the date to be “empty” (i.e. 1900-01-01)? And what if a NULL is passed?
In our environment, we’ve disallowed NULLs from our table fields. We understand that NULL is actually information– it says that the data is unknown– but we believe that for most data fields, there are non-NULL values which just as effectively represent unknown. Typically, 0’s and empty strings (and the “blank” date 1900-01-01) serve that purpose. And those values are more forgiving during coding (they equal things; they don’t make everything else “unknown”), and we accept the risk of paying little attention to which parts of our logic touched “unknown” values.
It’s an interesting look at dealing with optional and default parameters within procedures.
The programmers came to me and said we need to add a large number of columns to this table for one piece of functionality. It would more than double the total number of columns on the table. Oh, and all of the new columns would be NULL since we would only need to populate them if they were using that functionality and even then, not all of them would require data. The final result would be that 65-75% of the table would end up having nullable fields with the majority of those having NULL for the value.
I said what I think any sane DBA would say to this request: No.
Click through for the rest of the tale.
Back in my basement hideout, I spent the next couple of hours exploring the network and figuring out which server to connect to. The CTO was right; I did have enough access. I was sysadmin on the production SQL Server and had full admin access to the app server. I logged in to the app and with the help of a Profiler trace managed to figure out one of the main slow stored procedure calls that occurred any time someone saved a change via the user interface.
Pasting the procedure call into SSMS, I turned on Actual Execution Plan, hit F5, and got ready to see indications of a few missing indexes. I was ready to walk back upstairs, gloat to the CTO, and ask for a better workspace so I could continue to help. What I didn’t expect was what actually came back: Not one execution plan, or two, or three, but hundreds and hundreds. The scroll bar become progressively smaller as time clicked by and the elapsed counter did not stop running. All I’d done in the application was change the name of a single field. What was going on?
This was an amazing story full of cringe-worthy moments.
I’d seen systems that implemented both trace flags as startup parameters simultaneously. I’d helped organizations implement first T8048, then T4199 (based on the timing of my research and testing of the trace flags). This was the first time that there was a desire to implement the trace flags one-at-a-time and we had the choice of which to implement first.
I hadn’t chosen to put T8048 in first previously – that was just the way everything worked out. If I chose to follow that order – I’d be doing what I’d seen and done before. But… there was also a reason to choose the reverse order, with T4199 first. Spinlock issues – especially at that time – were considered more exotic performance issues than many of the “plan-shaping” issues that trace flag 4199 addressed. Many administrators were much more familiar with that type of performance issue – eliminating significant waits, altering plan shapes, making the logical work of queries more efficient – than with the busy wait/management overhead of spinlocks. Sometimes demonstrating an improvement that someone is already familiar with evaluating is a plus, helping to gain trust. I didn’t know of a specific reason NOT to put trace flag T4199 in place, followed by T8048 later. And in this case it seemed like building up some interpersonal capital might be a good idea.
Thinking through the full ramifications of trace flag changes is hard, even for sharp people like Lonny. Read on for the details of what happened next.
I can already hear managers saying:
If you don’t trust your employees, why employ them in the first place?
Well there is the whole accidental damage thing. I guess you could cover that by having a good backup\restore process (if your RTO and RPO permitted the downtime) but don’t expect to pass any security audits coming your way. Hint: your clients wont like this.
Plus, supposing everybody knows the sa account, there’s no way to know who accidentally(?) dropped the customer database.
From the day we started the merge until the day I left, there was a WTF moment almost every day. One of the biggest issues I had with my new manager (the whole list is too long for a single blog post) was that he treated his team like unskilled laborers. He did not support them or back them up. They were expected to do anything the dev teams asked them to do any time of day or night. Holidays, weekends, whenever. If they asked to have some code deployed, the DBAs were expected to do it within half an hour during the day or within an hour any other time. They would often get 2 or 3 deployment requests overnight. 1 AM, 2 AM, 4 AM, whatever. And the person on-call not only did it within an hour every time, but they were also expected to be in the office for their normal 40 hours that week.
There’s a lot of crazy in these stories.
The year was 2010. A consultant was working on designing and deploying System Center Operations Manager (OpsMgr) and System Center Configuration Manager (ConfigMgr) for a customer. The goal was to operationalize the enterprise software deployment process and systems monitoring. The servers were prepared, SQL Server was installed, OpsMgr and ConfigMgr configured. The project was in the stabilization phase waiting for the go-live schedule.
People fail; how much they learn from failures makes the difference.
Let me tell you the story of a small BI project I did a couple of years back. The project building a small data mart for the local branch of a very large multinational in consumer goods. The data was gathered by the retailers in the country, who send the details back to the company for analyses of the sales performance. The data was to be hosted internally at the company, not at the client. However, the server was commissioned but not yet delivered. In the meantime, I would simply start developing on my own machine – which was quite a powerful laptop with 8 cores and 32GB of RAM – and we would migrate the solution once the server arrived.
Read on for what happened. I was half-expecting “And that’s when my laptop became the new server” to be the punch line.
When you spend a reasonable amount of time churning out simple form-based UI workflows, you inevitably get bored. And you try and abstract away the drudgery of building forms into something the Common Folks can manage, by building a framework, because Lord knows you can’t buy this sort of thing off the shelf. Inevitably, the arcane knowledge of how to churn out expensive work without obviously doing expensive work can safely stay within an inner circle of senior developers who will be the only people ever allowed to try and configure a no-programming-needed workflow.
Never mind. Their hearts were in the right place.
Could be worse? Maybe they could have run a WAITFOR for the number of seconds corresponding with the string’s length and then collected that information from a Windows event log entry.