Airflow DAGs in ML | Michael Yu

Apache Airflow is a workflow scheduler and pipeline manager that lets you define, schedule, and monitor data processing tasks using Python

In recommendation systems, it’s used to automate complex pipelines like:
-data ingestion
-feature extraction
-model training
-evaluation
-deployment
-logging & feedback loops

Core Concepts in Airflow

DAG (Directed Acyclic Graph)

-a DAG is a python file that defines the tasks (units of work) and the order of execution (dependencies)

-each DAG runs on a schedule, or can be triggered manually

Real-World Use in RecSys:

[Extract logs]→[Preprocess features]→[Train model]→[Evaluate AUC/NDCG]→[Deploy model]

DAGs

A DAG is a model that encapsulates everything needed to execute a workflow. Some DAG attributes include the following:

-Schedule: when the workflow should run

-Tasks: tasks are discrete units of work that are run on workers

-Task Dependencies: the order and conditions under which tasks execute

-Callbacks: actions to take when the entire workflow completes

-Additional Parameters: and many other operational details

The DAG itself doesn’t care about what is happening inside the tasks; it is merely concerned with how to execute them - the order to run them in, how many times to retry them, if they have timeouts, and so on

Declaring a DAG

Three ways to declare a DAG. We can focus on the first way which is using with statement (context manager), which will add anything inside it to the DAG implicitly

with DAG(
	dag_id=”my_dag_name”,
	start_date=datetime.datetime(2021, 1, 1),
	schedule=”@daily”,
):
	EmptyOperator(task_id=”task”)

Task Dependencies

A Task/Operator does not usually live alone; it has dependencies on other tasks (those upstream of it), and other tasks depend on it (those downstream of it). Declaring these dependencies between tasks is what makes up the DAG structure (the edges of the directed acyclic graph).

There are two main ways to declare individual task dependencies. The recommended one is to use the » and « operators:

first_task » [second_task, third_task]

third_task « fourth_task

How to add an EMR Pipeline into your dag

First of all, EMR is Elastic MapReduce which is Amazon’s service for running big data processing jobs, often using Spark, Hive, or Presto/Trino. It’s typically used for offline computation such as generating training data, feature engineering, log processing, etc.

Adding an EMR pipeline into a DAG means adding a task (or series of tasks) to your Airflow DAG that launches a job on an EMR cluster

A common pattern in EMR pipeline is this:

Create EMR cluster

-spins up a new EMR cluster with your desired config (number of nodes, Spark version, etc.)

-in recommendation systems, this could involve choosing instance types, logging setup, etc.

Add step to the cluster

-adds a step to the running cluster like a Spark job, Hive query, or Presto job

-references the cluster ID from step 1

-this could be a Spark job to read raw logs, parsing them, joining with content metadata, and outputting aggregated dataset to new s3 location

-this is the core work. This could be feature computation, candidate generation, model training

Wait for Step to finish

-waits for a step to complete

-polls EMR periodically

-ensures downstream tasks don’t proceed until the job is done

-main purpose is monitoring

Terminate the EMR cluster

-shuts down the EMR cluster to avoid wasting resources and cost