So, can you run DBCC CHECKDB on a read only database? Should you run DBCC CHECKDB on a read only database?
tl;dr: YES AND YES!
Many forms of corruption that I’ve seen have come from storage. Sure, there have been bugs that were to blame, but yeah. Most of the time, it’s the storage going all yucky.
Erik also explains some gotchas, so read the whole thing.
So. Always. Always. ALWAYS choose to be proactive and prepared. Don’t wait for corruption to catch you! When we do our SQL Server health assessments, seeing the findings that together mean you aren’t prepared for corruption is a huge red flag. Partially it is because as a consultant, I end up seeing corruption a lot – and it is always “after the fact” and usually from clients who either chose or, more likely didn’t realize they were choosing, the option with less preparation.
So this post won’t really talk about recovering from corruption. It will focus on prevention and preparedness. A follow on post will talk about some initial steps to do if you get a report of corruption.
If you already know how you’ll solve the problem (and ideally, have a step-by-step runbook so you don’t miss anything), corruption is more of an annoyance than a catastrophe.
The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.
Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.
This is a nice article for operations folks who own Hadoop clusters.
PHYSICAL_ONLY looks at page structures and makes sure something isn’t obviously wrong. If you have the database’s page verification option set to CHECKSUM (the default since SQL Server 2005, and a very good thing), it runs a checksum on the page and compares it to a checksum recorded in the header when the page was last modified.
When you use the PHYSICAL_ONLY option, you’re telling SQL Server to skip logical checks. It won’t do things like make sure the data in a clustered and nonclustered index on a table is consistent with one another.
I like the analogy, as well as the how-to guide on messing up your database. But try not to do that to a production database…
THE EASY FIX: RUN FULL NATIVE BACKUPS EVERY DAY, AND FREAK OUT WHEN THEY FAIL.
Backup failures aren’t normally a big deal, but if you use in-memory OLTP on a standalone server or a failover clustered instance, backup failures are all-out emergencies. You need to immediately find out if the backup just ran out of drive space or lost its network connection, or if you have game-over Hekaton corruption.
Note that you can’t use SAN snapshot backups here. SQL Server won’t read the In-Memory OLTP pages during a snapshot backup, which means they can still be totally corrupt.
This works fine for shops with relatively small databases, say under 500GB.
Brent also has a Connect item to fix this. Upvote if you have memory-optimized, durable tables.
A fix has been published for SQL Server 2012 (SP2 and SP3), and one is coming for SQL Server 2014 (RTM and SP1) – though in this most recent update, SQL Server 2014 was removed from the list of affected versions (not sure it was accurate to remove those). I will update this space once the next 2014 CUs are released, but for now you should plan to rebuild tables after dropping columns as a matter of course and, more importantly, as part of the same maintenance operation.
This is a good reason to stay up to date on CUs.
I suspected that there was some difference between the queries that failed and the ones that were successful in SSMS. It ran the query they gave me, and I got the same error. I got disconnected and no further error info was returned. I also verified that the same query was successful on the otehr two tables mentioned. No errors on the other tables.
I wanted to know what error was causing the connection to be terminated, so I checked the SQL log and discovered that every time it failed, it was generating a stack dump. Before I was done investigating, it had generated 21 stack dumps. The key user-usable error info in the log was:
* Exception Address = 00007FF93BCF7E08 Module(sqlmin+00000000001E7E08)
* Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
* Access Violation occurred reading address 0000000000000000
It turns out that CHECKDB & CHECKTABLE do not look at statistics. If you find yourself in this situation, it’s not a bad idea to see if this is the cause.
An interesting corruption problem cropped up on the MCM distribution list yesterday and after I figured it out, I thought it would make a good blog post in case anyone hits a similar problem.
In a nutshell, the problem was corruption such that a simple SELECT * query failed, but a SELECT * query with an ORDER BY clause worked.
This is interesting, but hopefully you don’t encounter these types of problems very often in your environment.