In this article I’m going to help you start your journey with Apache Airflow
- heart of modern Big Data platforms. I’ll explain how to run Airflow with
the Celery executor in Docker.
Apache Airflow is an open source tool that helps you manage, run and monitor jobs based on CRON or external triggers. In the platforms we build
in Datumo, Airflow is a key component. We use it to perform all kinds of hourly, daily and weekly operations automatically. Each pipeline in Airflow is defined as directed acyclic graph (DAG). Let’s look at an example of the daily DAG we run on the Google Cloud Platform (GCP):
This DAG is performing a set of operations every day: first it checks if files containing data are already present on Google Cloud Storage, then loads them into BigQuery, prepares a Dataproc Cluster, runs a Spark enrichment job, ingest the data into Druid and shuts the cluster down. When something in the pipeline goes wrong, we’ll receive a Slack alert.
Airflow with Celery executor architecture
Thanks to Airflow’s modularity, every node can be installed in a separate hosts / container. The diagram below shows Airflow architecture with Celery executor:
Starting from Airflow version 1.10.7, webserver can be stateless. This means DAGs are now fetched from the database, instead of being parsed by Webserver as before. Starting from Airflow 2.0 webserver works only in stateless mode.
Airflow with Celery Workers
Running Airflow with Celery executor has several major advantages:
- speed - workers are always ready to use immediately,
- horizontal scalability - new Airflow workers can be added anytime,
- prioritization - you can give priority to your critical tasks.
Nothing is perfect, let’s have a look on limitation of Celery executors:
- autoscaling - there is no mechanism to scale Celery workers based on task queue usage out of the box,
- waste of resources - Celery workers are created on start-up. Maybe it would be better to give this resources to other parts of the system?
- library management - you need to install all your dependencies (like Java) beforehand on all of your workers. In big company most of the time you’re only user of Airflow. Nobody will allow you to install your favorite library without corporate process.
- resource allocation - you can’t configure available resources per each Airflow task. Some tasks like SparkSubmitOperator require small amount of resources, as they only watch process running on external system. On the other hand you can run machine learning model training as Airflow task. It would be nice if this two kinds of a job get different amount of resources.
All of mentioned problems can be solved using Airflow with Kubernetes executor. It’s worth mention that Airflow with Celery executor can be deployed in Kubernetes. Thanks to K8S you can leverage Keda to autoscale Celery workers. But this is not end! In Airflow 2.0 new executor was added
- Celery Kubernetes. Now you can combine strength of both executors.
Airflow deployment on single host with Docker
To run Airflow in Docker we need an Airflow image. From several images available in Docker Hub, we choose puckel/docker-airflow. This image contains a pre-installed Celery library, which we plan to use anyway. If you need to install any other libraries, you can do it in Dockerfile, as shown below. You can create a different Dockerfile for every service to avoid installing unnecessary libraries, since you will generally only need them on workers.
Take a closer look at the “&&” mark in the RUN command. It’s good practise to chain commands together, as this will reduce the number of image layers, resulting in a smaller image since Docker caches every image layer. To improve readability and maintainability, it’s recommended to split commands over separate lines using backslashes.
To deploy each Airflow service in separate container we will use docker-compose. If you are not familiar with docker-compose please read this article. In docker-compose.yaml we need to specify three Airflow services: webserver, scheduler and worker. In case you don’t have already Redis and MySQL running externally, you can add them here as well. Every service specified in docker-compose will run in a separate container.
To configure Airflow to use Celery, we need to specify some variables.
A simple and convenient way to do this is by attaching the env_file field in docker-compose. You can specify these settings in airflow.cfg and mount that file, but Airflow has variable hierarchy. Environmental variables are more important than variables from the airflow.cfg file, so they will be overwritten by the default ones provided by the image. To avoid splitting configuration into cfg and env files, I’m going to use an env file only, since I would need to do this anyway for the variables listed below.
We also can create systemd service for Airflow. This allows us to easily manage the service, without having to remember the location of the docker-compose.yml file.
Dags deployment
Adding DAG files is very simple, thanks to Docker volumes. We can mount local folder on our machine to the default DAG folder in a container. Yet this is also the biggest drawback to this solution. Since every component of Airflow needs to have access to DAG files, they need to be copied onto every machine with Airflow. You can get around this issue by using shared storage, such as an NFS server.
Airflow deployment on multiple hosts
If you want to install Redis or MySQL, or run Airflow workers on a different host, you need to create a connection between the containers on those hosts. This problem is easily solved by Kubernetes, where pods on one node can communicate with each other freely. By default, containers created from one docker-compose file are running on one network. To create connections between hosts you can use Docker Swarm, a built-in docker tool specifically intended for this purpose. Setting up a network like this is described in detail on the official Docker site, alongside a description of how to get containers to join the network automatically.
Summary
In this article I showed how to quickstart your journey with Airflow in Docker. It provides a scalable and reliable solution. Moreover, it offers a good starting point for later use in Kubernetes, which retains some advantages over this solution. The Kubernetes executor can dynamically spawn workers, so capacity is theoretically endless and you won’t need to have them running all the time, leading to lower resource costs. Starting from Airflow 2.0 there is also brand new KubernetesCelery executor which allows you to build almost perfect scheduling system.
Special thanks to Michał Misiewicz.