I’m seeing reports from a few people of DBCC CHECKDB reporting PFS corruption after an upgrade from SQL Server 2014 to SQL Server 2016 or later. The symptoms are that you run DBCC CHECKDB after the upgrade and get output similar to this:
Msg 8948, Level 16, State 6, Line 5
Database error: Page (3:3863) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.
Msg 8948, Level 16, State 6, Line 5
Database error: Page (3:3864) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.
CHECKDB found 2 allocation errors and 0 consistency errors not associated with any single object.
CHECKDB found 2 allocation errors and 0 consistency errors in database 'MyProdDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (MyProdDB).
I’ve discussed with the SQL Server team and this is a known bug in SQL Server 2014.
Read on for the fix and additional good advice.
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:2806320; actual 0:0). It occurred during a read of page (1:xxxxx) in database ID 5 at offset 0x00000xxxxx0000 in file ‘C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\YourDatabaseName.mdf’. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Since this is one of those things that I regularly work with, I thought I would see what other people are saying about this error message, and boy oh boy did I found some crazy and outright damaging suggestions
Steve puts together a bunch of really bad advice and explains why you shouldn’t follow it. Read the whole thing and listen to Steve’s advice, not the bad advice.
Before we dive into fixing database corruption it is important to know what not to do. There are things we can do which will make a bad situation even worse.
Don’t Panic – most sysadmins are used to having their cages rattled and can keep cool under duress. Jumping to action without a plan is not wise. Now is not the time to start trying things and performing thoughtless actions.
Do NOT Detach the Database – we may never get it back again as it may be in a recovery pending state.
No restarting SQL Services – databases may never get started up again just like the above.
Don’t Reboot the Machine – same as above 2 points
Don’t Start by Trying to Repair the Corruption – root cause analysis is critical to preventative measures
Jeff then walks us through things that he does to discern the root cause and correct the issue (if possible).
Jeff Mlakar has started a series on database corruption. His first post involves finding corruption:
The cause lies in layers below SQL Server. The most common are hardware faults; in particular, issues with the I/O Subsystem. Any component in the I/O Subsystem can fail and be the cause of database corruption: disks, controllers, CPU, memory, network switch, network cables, SAN, etc.
Database corruption cannot entirely be prevented. It is not a matter of if but rather when
Disks go bad. So do NICs, cables, routers, and everything else physical below the SQL Server Instance. This is why it is important to know that we cannot entirely prevent corruption – only deal with and mitigate it.
Click through for a few ways to find potential corruption.
Whenever I’m teaching and recovering from corruption, I always stress that if the REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB (which I’ll just call ‘repair’ from now on) has to be used, you should do the following:
- Run another DBCC CHECKDB to make sure that repair fixed everything and no new problems were exposed (by the repair fixing something that was blocking DBCC CHECKDB‘s access to a portion of the database)
- Run DBCC CHECKCONSTRAINTS on any tables or indexes that were repaired and are involved in constraints (stated in Books Online too)
- Reinitialize any replication subscriptions that are based on tables that were repaired (stated in Books Online too)
- Reevaluate your HA/DR strategy so you don’t have to use repair again in future
One question I’m often asked is about why replication can be broken by running repair.
Read on for the answer.
Corruption happens. It’s just a fact of life – storage is gonna fail. Microsoft’s SLAs for storage only give you 3-4 9’s, and there’s nothing in there about never losing your data. Nothing against Azure, either – I’ve lost entire VMs in AWS due to storage corruption.
So let’s demo it. Normally, this kind of thing might be hard to do, but at the moment, DBCC WRITEPAGE is enabled (although I expect that to change before MIs hit General Availability.) I used Erik’s notorious sp_GoAheadAndFireMe to purposely corrupt the master database (not TempDB. I modified it to work with a user database instead, ran it, and in less than ten seconds, the entire instance went unresponsive.
It’s a good post, so check it out.
Calmly, you settle in and check the server and eventually find your way to the error logs to see the following:
Msg 823, Level 24, State 2, Line 1
The operating system returned error 1(Incorrect function.) to SQL Server during a read at offset 0x0000104c05e000 in file ‘E:\Database\myproddb.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Suddenly you understand and feel the collective fear and paranoia. What do you do now that the world has seemingly come to an end for your database?
Definitely worth a read.
So, can you run DBCC CHECKDB on a read only database? Should you run DBCC CHECKDB on a read only database?
tl;dr: YES AND YES!
Many forms of corruption that I’ve seen have come from storage. Sure, there have been bugs that were to blame, but yeah. Most of the time, it’s the storage going all yucky.
Erik also explains some gotchas, so read the whole thing.
So. Always. Always. ALWAYS choose to be proactive and prepared. Don’t wait for corruption to catch you! When we do our SQL Server health assessments, seeing the findings that together mean you aren’t prepared for corruption is a huge red flag. Partially it is because as a consultant, I end up seeing corruption a lot – and it is always “after the fact” and usually from clients who either chose or, more likely didn’t realize they were choosing, the option with less preparation.
So this post won’t really talk about recovering from corruption. It will focus on prevention and preparedness. A follow on post will talk about some initial steps to do if you get a report of corruption.
If you already know how you’ll solve the problem (and ideally, have a step-by-step runbook so you don’t miss anything), corruption is more of an annoyance than a catastrophe.
The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.
Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.
This is a nice article for operations folks who own Hadoop clusters.