Multi-Intervention Sequential Decision Modeling

Overview — I built an end-to-end experimental pipeline to study reinforcement learning in sequential decision settings where multiple interacting interventions must be selected in combination at each timestep. The pipeline integrates synthetic data generation, model training, controlled experimentation, and evaluation. At its core is a graph-structured reinforcement learning architecture that produces intervention-level value estimates, enabling scalable decision-making without exhaustively evaluating exponentially many intervention combinations.

Setting — Columbia IEOR PhD program, advised by Prof. Lily Xu.

Github — graph-structured-rl.

Language — Python.

Tech Stack — PyTorch, PyTorch Geometric, Gymnasium, Optuna, NetworkX, scikit-learn, Gurobi.

Sequential Decision Model

I built an end-to-end experimental pipeline to improve how reinforcement learning agents make decisions when multiple interventions must be chosen in combination at each timestep. The motivation came from clinical-style decision settings where at each step, a combination of treatments (e.g., drug type, intensity, frequency) must be selected, often with limited data.

Standard reinforcement learning struggles in these environments. As the number of intervention options grows, the action space explodes combinatorially, making learning unstable and sample-inefficient.

We can visualize the intervention options and the state variables as nodes in a graph.

graph-rl-drawing-1

The core idea was to exploit structure. First, we estimate a dependency graph between interventions and system state variables using regression-based methods. This produced an approximate map of how individual interventions influence the system.

graph-rl-drawing-2

We then embed this structure into a graph neural network-based Q-learning architecture. Rather than evaluating every possible intervention combination, the model incrementally constructs an action by estimating the marginal value of individual interventions conditioned on the current state and prior selections.

graph-rl-drawing-3

We repeat this process until a choice has been made for all interventions, and this concludes decision making for one timestep.

This reframing avoids exhaustive combination evaluation, and allows each partial decision to incorporate structured information about system dynamics. To support this, I built a full experimental pipeline: synthetic data generation, model training, and controlled experimentation to evaluate performance against the standard approach.

Presentation

Department presentation of this project: