The Kafka community has a culture of deep and extensive code review that tries to proactively find correctness and performance issues. Code review is, of course, a pretty common practice in software engineering but it is often cursory check of style and high-level design. We’ve found a deeper investment of time in code review really pays off.
The failures in distributed systems often have to do with error conditions, often in combinations and states that can be difficult to trigger in a targeted test. There is simply no substitute for a deeply paranoid individual going through new code line-by-line and spending significant time trying to think of everything that could go wrong. This often helps to find the kind of rare problem that can be hard to trigger in a test.
Testing data processing engines is difficult, particularly distributed systems where things like network partitions and transient errors are hard to reproduce in a test environment.