Eric Blinn has started a new series on database corruption. In part one, he shows us how to corrupt a database (probably by letting it listen to Alice Cooper and Frank Zappa):
I’m going to start a series of posts regarding corruption detection, prevention, and correction. In order to detect corruption we must first have a corrupt database. This post is going to show how to purposely corrupt a database. I’ll refer back to this post often as it will be the basis of most of the future posts in this series which will pick up where this leaves off.
Step 1. Create a new database. You didn’t think I was going to ask you to corrupt a real database did you? We will create one table with some names in it. The table is abnormally wide on purpose. The goal is to get fewer rows on any one page of data so that we can get several pages of data without needing many hundreds of rows.
Page verification is a process in SQL Server where the engine writes extra data to the header of the page while writing it to disk. This extra data can be used to verify that the data later read from that page is what was expected. There are 3 options for the setting. They are NONE, CHECKSUM, and TORN_PAGE_DETECTION. The torn page option is deprecated. It should not be used and will not be covered in this series.
When set to CHECKSUM SQL Server will quickly determine a checksum for the page of data and write it to the header of the data page. Any time from that point forward when the page is read from disk SQL Server will perform the same checksum calculation and compare that to to the stored value in the header of the page. If the value matches that would indicate there is probably not corruption. If the values do not match that would indicate there is almost certainly some kind of corruption.
Corruption is one of the scariest things that can happen to a database; knowing how to respond to an incident of corruption is critical even as the incidence of corruption is quite low in practice (unless you’re running on awful hardware).
I’m seeing reports from a few people of DBCC CHECKDB reporting PFS corruption after an upgrade from SQL Server 2014 to SQL Server 2016 or later. The symptoms are that you run DBCC CHECKDB after the upgrade and get output similar to this:
Msg 8948, Level 16, State 6, Line 5
Database error: Page (3:3863) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.
Msg 8948, Level 16, State 6, Line 5
Database error: Page (3:3864) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.
CHECKDB found 2 allocation errors and 0 consistency errors not associated with any single object.
CHECKDB found 2 allocation errors and 0 consistency errors in database 'MyProdDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (MyProdDB).
I’ve discussed with the SQL Server team and this is a known bug in SQL Server 2014.
Read on for the fix and additional good advice.
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:2806320; actual 0:0). It occurred during a read of page (1:xxxxx) in database ID 5 at offset 0x00000xxxxx0000 in file ‘C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\YourDatabaseName.mdf’. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Since this is one of those things that I regularly work with, I thought I would see what other people are saying about this error message, and boy oh boy did I found some crazy and outright damaging suggestions
Steve puts together a bunch of really bad advice and explains why you shouldn’t follow it. Read the whole thing and listen to Steve’s advice, not the bad advice.
Before we dive into fixing database corruption it is important to know what not to do. There are things we can do which will make a bad situation even worse.
Don’t Panic – most sysadmins are used to having their cages rattled and can keep cool under duress. Jumping to action without a plan is not wise. Now is not the time to start trying things and performing thoughtless actions.
Do NOT Detach the Database – we may never get it back again as it may be in a recovery pending state.
No restarting SQL Services – databases may never get started up again just like the above.
Don’t Reboot the Machine – same as above 2 points
Don’t Start by Trying to Repair the Corruption – root cause analysis is critical to preventative measures
Jeff then walks us through things that he does to discern the root cause and correct the issue (if possible).
Jeff Mlakar has started a series on database corruption. His first post involves finding corruption:
The cause lies in layers below SQL Server. The most common are hardware faults; in particular, issues with the I/O Subsystem. Any component in the I/O Subsystem can fail and be the cause of database corruption: disks, controllers, CPU, memory, network switch, network cables, SAN, etc.
Database corruption cannot entirely be prevented. It is not a matter of if but rather when
Disks go bad. So do NICs, cables, routers, and everything else physical below the SQL Server Instance. This is why it is important to know that we cannot entirely prevent corruption – only deal with and mitigate it.
Click through for a few ways to find potential corruption.
Whenever I’m teaching and recovering from corruption, I always stress that if the REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB (which I’ll just call ‘repair’ from now on) has to be used, you should do the following:
- Run another DBCC CHECKDB to make sure that repair fixed everything and no new problems were exposed (by the repair fixing something that was blocking DBCC CHECKDB‘s access to a portion of the database)
- Run DBCC CHECKCONSTRAINTS on any tables or indexes that were repaired and are involved in constraints (stated in Books Online too)
- Reinitialize any replication subscriptions that are based on tables that were repaired (stated in Books Online too)
- Reevaluate your HA/DR strategy so you don’t have to use repair again in future
One question I’m often asked is about why replication can be broken by running repair.
Read on for the answer.
Corruption happens. It’s just a fact of life – storage is gonna fail. Microsoft’s SLAs for storage only give you 3-4 9’s, and there’s nothing in there about never losing your data. Nothing against Azure, either – I’ve lost entire VMs in AWS due to storage corruption.
So let’s demo it. Normally, this kind of thing might be hard to do, but at the moment, DBCC WRITEPAGE is enabled (although I expect that to change before MIs hit General Availability.) I used Erik’s notorious sp_GoAheadAndFireMe to purposely corrupt the master database (not TempDB. I modified it to work with a user database instead, ran it, and in less than ten seconds, the entire instance went unresponsive.
It’s a good post, so check it out.
Calmly, you settle in and check the server and eventually find your way to the error logs to see the following:
Msg 823, Level 24, State 2, Line 1
The operating system returned error 1(Incorrect function.) to SQL Server during a read at offset 0x0000104c05e000 in file ‘E:\Database\myproddb.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Suddenly you understand and feel the collective fear and paranoia. What do you do now that the world has seemingly come to an end for your database?
Definitely worth a read.
So, can you run DBCC CHECKDB on a read only database? Should you run DBCC CHECKDB on a read only database?
tl;dr: YES AND YES!
Many forms of corruption that I’ve seen have come from storage. Sure, there have been bugs that were to blame, but yeah. Most of the time, it’s the storage going all yucky.
Erik also explains some gotchas, so read the whole thing.
So. Always. Always. ALWAYS choose to be proactive and prepared. Don’t wait for corruption to catch you! When we do our SQL Server health assessments, seeing the findings that together mean you aren’t prepared for corruption is a huge red flag. Partially it is because as a consultant, I end up seeing corruption a lot – and it is always “after the fact” and usually from clients who either chose or, more likely didn’t realize they were choosing, the option with less preparation.
So this post won’t really talk about recovering from corruption. It will focus on prevention and preparedness. A follow on post will talk about some initial steps to do if you get a report of corruption.
If you already know how you’ll solve the problem (and ideally, have a step-by-step runbook so you don’t miss anything), corruption is more of an annoyance than a catastrophe.