Category: Corruption

Before we dive into fixing database corruption it is important to know what not to do. There are things we can do which will make a bad situation even worse.

Don’t Panic – most sysadmins are used to having their cages rattled and can keep cool under duress. Jumping to action without a plan is not wise. Now is not the time to start trying things and performing thoughtless actions.
Do NOT Detach the Database – we may never get it back again as it may be in a recovery pending state.
No restarting SQL Services – databases may never get started up again just like the above.
Don’t Reboot the Machine – same as above 2 points
Don’t Start by Trying to Repair the Corruption – root cause analysis is critical to preventative measures

Jeff then walks us through things that he does to discern the root cause and correct the issue (if possible).

Comments closed

Diagnosing Database Corruption

Published 2018-07-23 by Kevin Feasel

Jeff Mlakar has started a series on database corruption. His first post involves finding corruption:

The cause lies in layers below SQL Server. The most common are hardware faults; in particular, issues with the I/O Subsystem. Any component in the I/O Subsystem can fail and be the cause of database corruption: disks, controllers, CPU, memory, network switch, network cables, SAN, etc.

Database corruption cannot entirely be prevented. It is not a matter of if but rather when

Disks go bad. So do NICs, cables, routers, and everything else physical below the SQL Server Instance. This is why it is important to know that we cannot entirely prevent corruption – only deal with and mitigate it.

Click through for a few ways to find potential corruption.

Comments closed

Why CHECKDB Repair Invalidates Replication

Published 2018-04-05 by Kevin Feasel

Paul Randal explains why running DBCC CHECKDB on a published article will cause subscriptions to become invalidated:

Whenever I’m teaching and recovering from corruption, I always stress that if the REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB (which I’ll just call ‘repair’ from now on) has to be used, you should do the following:

Run another DBCC CHECKDB to make sure that repair fixed everything and no new problems were exposed (by the repair fixing something that was blocking DBCC CHECKDB‘s access to a portion of the database)

Run DBCC CHECKCONSTRAINTS on any tables or indexes that were repaired and are involved in constraints (stated in Books Online too)

Reinitialize any replication subscriptions that are based on tables that were repaired (stated in Books Online too)

Reevaluate your HA/DR strategy so you don’t have to use repair again in future

One question I’m often asked is about why replication can be broken by running repair.

Read on for the answer.

Comments closed

Corrupting Managed Instances

Published 2018-03-20 by Kevin Feasel

Brent Ozar has found a bug with Azure SQL Database Managed Instances:

Corruption happens. It’s just a fact of life – storage is gonna fail. Microsoft’s SLAs for storage only give you 3-4 9’s, and there’s nothing in there about never losing your data. Nothing against Azure, either – I’ve lost entire VMs in AWS due to storage corruption.

So let’s demo it. Normally, this kind of thing might be hard to do, but at the moment, DBCC WRITEPAGE is enabled (although I expect that to change before MIs hit General Availability.) I used Erik’s notorious sp_GoAheadAndFireMe to purposely corrupt the master database (not TempDB. I modified it to work with a user database instead, ran it, and in less than ten seconds, the entire instance went unresponsive.

It’s a good post, so check it out.

Comments closed

A Story Of Database Corruption

Published 2018-01-19 by Kevin Feasel

Jason Brimhall tells a tale of a corrupt database:

Calmly, you settle in and check the server and eventually find your way to the error logs to see the following:

Msg 823, Level 24, State 2, Line 1

The operating system returned error 1(Incorrect function.) to SQL Server during a read at offset 0x0000104c05e000 in file ‘E:\Database\myproddb.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Suddenly you understand and feel the collective fear and paranoia. What do you do now that the world has seemingly come to an end for your database?

Definitely worth a read.

Comments closed

CHECKDB For Read-Only Databases?

Published 2017-03-24 by Kevin Feasel

Erik Darling answers a reader’s question:

So, can you run DBCC CHECKDB on a read only database? Should you run DBCC CHECKDB on a read only database?

tl;dr: YES AND YES!

Here’s why:
Many forms of corruption that I’ve seen have come from storage. Sure, there have been bugs that were to blame, but yeah. Most of the time, it’s the storage going all yucky.

Erik also explains some gotchas, so read the whole thing.

Comments closed

Getting Ahead Of Corruption

Published 2017-01-26 by Kevin Feasel

Mike Walsh has some recommendations before you have corrupt databases:

So. Always. Always. ALWAYS choose to be proactive and prepared. Don’t wait for corruption to catch you! When we do our SQL Server health assessments, seeing the findings that together mean you aren’t prepared for corruption is a huge red flag. Partially it is because as a consultant, I end up seeing corruption a lot – and it is always “after the fact” and usually from clients who either chose or, more likely didn’t realize they were choosing, the option with less preparation.

So this post won’t really talk about recovering from corruption. It will focus on prevention and preparedness. A follow on post will talk about some initial steps to do if you get a report of corruption.

If you already know how you’ll solve the problem (and ideally, have a step-by-step runbook so you don’t miss anything), corruption is more of an annoyance than a catastrophe.

Comments closed

Understanding HDFS Disk Checks

Published 2016-12-21 by Kevin Feasel

Xiao Chen explains how the HDFS Disk Checker works for data nodes:

The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.

Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.

This is a nice article for operations folks who own Hadoop clusters.

Comments closed

DBCC CHECKDB’s NOINDEX Option

Published 2016-08-17 by Kevin Feasel

Kendra Little explains the NOINDEX option of DBCC CHECKDB and how it relates to PHYSICAL_ONLY:

PHYSICAL_ONLY looks at page structures and makes sure something isn’t obviously wrong. If you have the database’s page verification option set to CHECKSUM (the default since SQL Server 2005, and a very good thing), it runs a checksum on the page and compares it to a checksum recorded in the header when the page was last modified.

When you use the PHYSICAL_ONLY option, you’re telling SQL Server to skip logical checks. It won’t do things like make sure the data in a clustered and nonclustered index on a table is consistent with one another.

I like the analogy, as well as the how-to guide on messing up your database. But try not to do that to a production database…

Comments closed

DBCC Checking Memory-Optimized Tables

Published 2016-04-14 by Kevin Feasel

Brent Ozar shows us how to run a DBCC check against a memory-optimized table. The answer is not trivial:

THE EASY FIX: RUN FULL NATIVE BACKUPS EVERY DAY, AND FREAK OUT WHEN THEY FAIL.

Backup failures aren’t normally a big deal, but if you use in-memory OLTP on a standalone server or a failover clustered instance, backup failures are all-out emergencies. You need to immediately find out if the backup just ran out of drive space or lost its network connection, or if you have game-over Hekaton corruption.

Note that you can’t use SAN snapshot backups here. SQL Server won’t read the In-Memory OLTP pages during a snapshot backup, which means they can still be totally corrupt.

This works fine for shops with relatively small databases, say under 500GB.

Brent also has a Connect item to fix this. Upvote if you have memory-optimized, durable tables.

Comments closed