In April 2015, we discussed stale and dirty reads in MongoDB 2.6.7. However, writes appeared to be safe; update-only workloads with majority write concern were linearizable. This conclusion was not entirely correct. In this Jepsen analysis, we develop new tests which show the MongoDB v0 replication protocol is intrinsically unsafe, allowing the loss of majority-committed documents. In addition, we show that the new v1 replication protocol has multiple bugs, allowing data loss in all versions up to MongoDB 3.2.11 and 3.4.0-rc4. While the v0 protocol remains broken, patches for v1 are available in MongoDB 3.2.12 and 3.4.0, and now pass the expanded Jepsen test suite. This work was funded by MongoDB, and conducted in accordance with the Jepsen ethics policy.
Mongo has grown up when it comes to data integrity, though be sure you’re using the v1 replication protocol.
During development and initial testing on our own hardware, we had the migration at the time running at ~25minutes for around 600 packages (ie. tables) covering (what we termed) RawSource–>Source–>Staging which was well within the performance requirements for the stage that development was at and for what was initially set out. The rest of this blog post will hone in specifically on Source–>Staging only.
However, once we transferred the solution to the clients development environment things took a turn for the worse. In our environment we were running VMs with 8 cores, 16GB RAM and utlising SSDs. The client environment was running SQL Server 2016 Enterprise on VMWare vSphere 5.5, 8 vCPUs, 32GB RAM (for Integration, Development was half this) but the infrastructure team have done everything in their power to force all VMs onto the lower tier (ie. slow disks) of their 3-PAR SAN and throttle them in every way possible, just to make things more of a challenge. Even though the VM’s themselves were throttled we were confident that we wouldn’t see too much of a performance impact, especially as this was only a subset of the processing to be done so we needed it to be quick and it will only ever get longer and longer.
Chris walks through the hallmarks of when Delayed Durability might work, and the big one for me is the way data migration works: full reloads. The important thing is to have a durable source and a process to repeat data loads when things get missed; in this case, it’s a full reload, but in other cases it could be watchdog applications which compare data sets on each side.
Recently, while considering possible use cases for Delayed Durability, it occurred to me that data loss might be entirely acceptable in cases where the data would not truly be lost. I have worked with a number of applications that have processes that purge old information from the database. If a purge process failed in these applications, data would simply live a little bit longer, and be purged the next time the process was successful – they have a recovery mechanism built in as it is. I decided to test Delayed Durability in a database with a long-running purge to observe the potential performance impact. I chose a process that was clearly contributing to transaction log waits, because that is where the real performance impact comes from when delaying durability. If you do not have notable waits or some level of a bottleneck there, you are not likely to improve anything simply by turning on this feature.
I was not aware that you could set durability at the transaction level; I was under the mistaken impression that once you flipped the switch, all transactions were subject to Delayed Durability. Disk-heavy operations (like large batches of deletions) does seem like a good use case for this.