This problem is easy to solve, right? You can write scripts that run jobs in sequence, and use the output of one program as the input to another—no problem. But what if your workflow is complex and requires specific triggers, such as specific data volumes or resource constraints, or must meet strict SLAs? What if parts of your workflow don’t depend on each other and can be run in parallel?
Building your own infrastructure around this problem can seem like an attractive idea, but doing so can quickly become laborious. If, or rather when, those requirements change, modifying such a tool isn’t easy . And what if you need monitoring around these jobs? Monitoring requires another set of tools and headaches.
This is a pretty detailed look at the basics of Oozie.