Whenever I’m teaching and recovering from corruption, I always stress that if the REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB (which I’ll just call ‘repair’ from now on) has to be used, you should do the following:
- Run another DBCC CHECKDB to make sure that repair fixed everything and no new problems were exposed (by the repair fixing something that was blocking DBCC CHECKDB‘s access to a portion of the database)
- Run DBCC CHECKCONSTRAINTS on any tables or indexes that were repaired and are involved in constraints (stated in Books Online too)
- Reinitialize any replication subscriptions that are based on tables that were repaired (stated in Books Online too)
- Reevaluate your HA/DR strategy so you don’t have to use repair again in future
One question I’m often asked is about why replication can be broken by running repair.
Read on for the answer.
Corruption happens. It’s just a fact of life – storage is gonna fail. Microsoft’s SLAs for storage only give you 3-4 9’s, and there’s nothing in there about never losing your data. Nothing against Azure, either – I’ve lost entire VMs in AWS due to storage corruption.
So let’s demo it. Normally, this kind of thing might be hard to do, but at the moment, DBCC WRITEPAGE is enabled (although I expect that to change before MIs hit General Availability.) I used Erik’s notorious sp_GoAheadAndFireMe to purposely corrupt the master database (not TempDB. I modified it to work with a user database instead, ran it, and in less than ten seconds, the entire instance went unresponsive.
It’s a good post, so check it out.
Calmly, you settle in and check the server and eventually find your way to the error logs to see the following:
Msg 823, Level 24, State 2, Line 1
The operating system returned error 1(Incorrect function.) to SQL Server during a read at offset 0x0000104c05e000 in file ‘E:\Database\myproddb.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Suddenly you understand and feel the collective fear and paranoia. What do you do now that the world has seemingly come to an end for your database?
Definitely worth a read.
So, can you run DBCC CHECKDB on a read only database? Should you run DBCC CHECKDB on a read only database?
tl;dr: YES AND YES!
Many forms of corruption that I’ve seen have come from storage. Sure, there have been bugs that were to blame, but yeah. Most of the time, it’s the storage going all yucky.
Erik also explains some gotchas, so read the whole thing.
So. Always. Always. ALWAYS choose to be proactive and prepared. Don’t wait for corruption to catch you! When we do our SQL Server health assessments, seeing the findings that together mean you aren’t prepared for corruption is a huge red flag. Partially it is because as a consultant, I end up seeing corruption a lot – and it is always “after the fact” and usually from clients who either chose or, more likely didn’t realize they were choosing, the option with less preparation.
So this post won’t really talk about recovering from corruption. It will focus on prevention and preparedness. A follow on post will talk about some initial steps to do if you get a report of corruption.
If you already know how you’ll solve the problem (and ideally, have a step-by-step runbook so you don’t miss anything), corruption is more of an annoyance than a catastrophe.
The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.
Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.
This is a nice article for operations folks who own Hadoop clusters.
PHYSICAL_ONLY looks at page structures and makes sure something isn’t obviously wrong. If you have the database’s page verification option set to CHECKSUM (the default since SQL Server 2005, and a very good thing), it runs a checksum on the page and compares it to a checksum recorded in the header when the page was last modified.
When you use the PHYSICAL_ONLY option, you’re telling SQL Server to skip logical checks. It won’t do things like make sure the data in a clustered and nonclustered index on a table is consistent with one another.
I like the analogy, as well as the how-to guide on messing up your database. But try not to do that to a production database…
THE EASY FIX: RUN FULL NATIVE BACKUPS EVERY DAY, AND FREAK OUT WHEN THEY FAIL.
Backup failures aren’t normally a big deal, but if you use in-memory OLTP on a standalone server or a failover clustered instance, backup failures are all-out emergencies. You need to immediately find out if the backup just ran out of drive space or lost its network connection, or if you have game-over Hekaton corruption.
Note that you can’t use SAN snapshot backups here. SQL Server won’t read the In-Memory OLTP pages during a snapshot backup, which means they can still be totally corrupt.
This works fine for shops with relatively small databases, say under 500GB.
Brent also has a Connect item to fix this. Upvote if you have memory-optimized, durable tables.
A fix has been published for SQL Server 2012 (SP2 and SP3), and one is coming for SQL Server 2014 (RTM and SP1) – though in this most recent update, SQL Server 2014 was removed from the list of affected versions (not sure it was accurate to remove those). I will update this space once the next 2014 CUs are released, but for now you should plan to rebuild tables after dropping columns as a matter of course and, more importantly, as part of the same maintenance operation.
This is a good reason to stay up to date on CUs.
I suspected that there was some difference between the queries that failed and the ones that were successful in SSMS. It ran the query they gave me, and I got the same error. I got disconnected and no further error info was returned. I also verified that the same query was successful on the otehr two tables mentioned. No errors on the other tables.
I wanted to know what error was causing the connection to be terminated, so I checked the SQL log and discovered that every time it failed, it was generating a stack dump. Before I was done investigating, it had generated 21 stack dumps. The key user-usable error info in the log was:
* Exception Address = 00007FF93BCF7E08 Module(sqlmin+00000000001E7E08)
* Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
* Access Violation occurred reading address 0000000000000000
It turns out that CHECKDB & CHECKTABLE do not look at statistics. If you find yourself in this situation, it’s not a bad idea to see if this is the cause.