Press "Enter" to skip to content

Building an ETL Pipeline with Airflow and Containers

Nikita Vasilev needs to move some data:

Obviously, we can use one of the many ready-made ETL systems that implement the functions of loading information into the corporate data warehouse. Informatica PowerCenter, Oracle Data Integrator, SAP Data Services, Oracle Warehouse Builder, Talend Open Studio, Pentaho are just a sliver of off-the-shelf solutions. However, when it comes to large volumes of data at high speeds and Big Data infrastructure already in place, boxed solutions fall flat to satisfy your needs.

Therefore, Big Data pipelines require something like Apache Airflow. It’s an open-source set of libraries for developing, planning, and monitoring workflows. Airflow is written in Python and allows you to create and configure task chains both visually with a clear web-GUI and to write Python program code.

Click through for an example using Airflow with AWS’s Elastic Container Service.