Turning Airflow into a full self service Data Platform

Apache Airflow logo

Apache Airflow has become one of the most widely used data engineering tool on the market. That’s no coincidence, as it is an awesome tool. Easy to install, scalable, compatible with pretty much everything in the market and very easy to use.

Even though Airflow is easy to get the hang of, it is yet another tool to learn. For teams that will not work with it on a daily basis, it is not so trivial. That could lead to an overloaded data engineering team, working overtime to automate jobs that could easily be handled automatically.

For example: a Data Scientist has a Spark code that retrains a model, and needs it to be executed weekly. So, on top of building that, you are going to make this person create an Airflow DAG to do that? Come on, that’s not cool. Let’s automate that process!

That’s where your Data Platform engineer comes in. And here’s the cool thing about Airflow: DAG definitions are done in code, not a fancy UI interface. That means you can define many DAGs in a loop, using a config file for their intricacies. Something like:

schedule: 0 0 * * 1
type: spark
- retrain_model.py
schedule: 0 * * *
type: python
- pandas_etl.process
schedule: 0 * * *
type: bash
- bash_script.sh

Then it is all a matter of reading the config and creating the DAGs accordingly. From Airflow’s tutorial basic pipeline definition, we’ll add:

  • reads config yaml file

Voilá. Now, your Data Scientist only needs to provide his own code and a cron expression. Doesn’t even need to know you’re using Airflow or anything else. Also, it makes it easy to move to another data pipeline tool: all you need to do is write a little code to create your jobs on the new tool.

This is a basic template. You can add many more options if you need, like allowing your config to change the default_args , add params to your operators, etc. Heck, you can have it so every aspect of your DAG is configurable, like descriptions and dependencies.

This Python file is available in my GitHub.

DISCLAIMER: this is a quick written piece of code. I advise to use it as reference only. Let me know if you find any mistakes. Also, feel free to copy it at your own risk.

Shout out to will, where an amazing team is doing amazing work like this and much more! If you think this is cool, check out the positions available right now, and come have a chat about all kinds of things Data related.

Don’t know us yet? Check out will bank and myself at LinkedIn

[update: tested Python job, it works fine. all files in DAG directory root.]

Analytics/Data Engineering/Software Developer