Revealing Evolutionary Action Consequence Trajectories
[Expand]

REACT optimizes interpretable demonstrations for trained reinforcement learning policies by evolving trajectory encodings against surrogate fitness terms (globaldiversity, local diversity, and certainty). The figure above summarizes the training, evolutionary optimization, and evaluation pipeline used throughout the paper.
REACT Demonstrations
[Expand]
FlatGrid11 (PPO trained for 35k steps)
| REACT | Fidelity | Random | Train |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- |
HoleyGrid11 (PPO trained for 150k steps)
| REACT | Fidelity | Random | Train |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| HoleyGrid REACT Heatmap | ![]() |
![]() |
- |
Continous Robot Control
| Training | REACT | Fidelity | Random | Train | Trajectories |
|---|---|---|---|---|---|
| SAC (50k) | ![]() |
![]() |
![]() |
![]() |
![]() |
| SAC (100k) | ![]() |
![]() |
![]() |
![]() |
![]() |
| SAC (150k) | ![]() |
![]() |
![]() |
![]() |
![]() |
For further evaluation results regarding the resuling demonstration fidelity, reward optimality gap, and comparisons of different fitness particles and their influence, please refer to the full paper.
Reproduce Results
[Expand]
Setup
Clone this repository and run pip install -e . to install this project in editable mode.
Paramters
| env_name | steps | trainseed | model | pop_size | iterations | enconding_length |
|---|---|---|---|---|---|---|
| FlatGrid11 | 35000 | 42 | PPO | 10 | 40 | 6 |
| HoleyGrid11 | 150000 | 33 | PPO | 10 | 40 | 6 |
| FetchReach | 50000 | 42 | SAC | 10 | 40 | 9 |
| FetchReach | 100000 | 42 | SAC | 10 | 40 | 9 |
| FetchReach | 150000 | 42 | SAC | 10 | 40 | 9 |
Train the evaluated policy
react train --env-name {{env_name}} --name train --model {{model}} --steps {{steps}} --env-seed [trainseed]
models and videos are saved to experiments/model/<env_name>
Evaluate the resulting policy
react run --env-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --seed {{seed}}
Videos are saved to experiments/videos.
Optimize REACT demonstrations
react evo --env-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --name {{env_name}}-0 --seed {{seed}} --pop-size {{pop_size}} --iterations {{iterations}} --encoding-length {{enconding_length}} --plot-frequency {{plot_frequency}} --is-elitist --crossover 0.75 --mutation 0.5
To view the resulting demonstrations run:
react plot --env-name {{env_name}} --exp-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --render
To compare the results with random search run:
react baseline1 --env-name {{env_name}} --saved-model train1_{{model}} --checkpoint {{steps}} --pop-size {{pop_size}} --iterations 1 --encoding-length {{encoding_length}} --name FlatGrid11 --seed {{trainseed}} --plot
Random Seeds
For evaluation we used the following seeds: 42, 13, 24, 18, 46, 19, 28, 32, 91, 12
To reproduce all seeds run:
./scripts/{{env_name}}/run_train.sh
./scripts/{{env_name}}/run_react.sh
./scripts/{{env_name}}/run_ablations.sh
To reproduce the plots, use:
react plot --env-name FlatGrid11 --render
react plot --env-name HoleyGrid11 --render
react plot --env-name Fetch50k --render
react plot --env-name Fetch100k --render
react plot --env-name Fetch150k --render --training
CLI Reference
[Expand]
Training
react train \ # Train an RL policy.
--env-name FlatGrid11 \ # Environment: FlatGrid11, HoleyGrid11, or FetchReach.
--name train \ # Run/model name used for logs and saved artifacts.
--model ppo \ # RL algorithm (e.g., ppo, sac).
--steps 35000 \ # Number of training steps.
--save-freq 5000 \ # Optional checkpoint interval (omit for no checkpoints).
--env-seed 42 \ # Optional env layout seed (relevant for HoleyGrid).
--render # Optional real-time environment rendering.
Running a trained model
react run \ # Run inference with a trained policy.
--env-name FlatGrid11 \ # Environment to evaluate.
--saved-model train_ppo \ # Saved model name (e.g., train_ppo).
--nr-episodes 5 \ # Number of evaluation episodes.
--checkpoint 35000 \ # Optional checkpoint step to load.
--env-seed 42 # Optional env layout seed for HoleyGrid.
Run REACT
react evo \ # Run evolutionary optimization for REACT demonstrations.
--env-name FlatGrid11 \ # Environment to optimize on.
--saved-model train_ppo \ # Trained base policy.
--checkpoint 35000 \ # Optional checkpoint step to load.
--name FlatGrid11-0 \ # Experiment name for outputs and plots.
--pop-size 10 \ # Population size.
--iterations 40 \ # Evolution iterations (0 behaves like random search).
--encoding-length 6 \ # Genotype/trajectory encoding length.
--plot-frequency 5 \ # Plot update interval (0 disables plotting).
--crossover 0.75 \ # Crossover probability.
--mutation 0.5 \ # Mutation probability.
--is-elitist \ # Keep strongest individuals between generations.
--env-seed 42 \ # Optional env layout seed for HoleyGrid.
--seed 42 \ # Random seed for the optimizer.
--w1 1 --w2 1 --w3 1 --w4 1 \ # Fitness weights: global, local, certainty, min-distance.
--render # Optional rendering during optimization.
These weights (w1,w2,w3,w4) allow the configuration of various ablations:
w1=1 w2=1 w3=1 w4=1 # REACT: full objective.
w1=1 w2=0 w3=0 w4=0 # REACT_G: global diversity only.
w1=0 w2=1 w3=1 w4=1 # REACT_D: local distance-focused variant.
w1=1 w2=1 w3=1 w4=0 # REACT_P: simple sum without distance term.
w1=0 w2=1 w3=0 w4=0 # REACT_L: local diversity only.
w1=0 w2=0 w3=1 w4=0 # REACT_C: certainty only.
w1=0 w2=0 w3=0 w4=0 # REACT_F: fidelity-only baseline.
Plot results
react plot \ # Generate paper-style plots and optional rollouts.
--env-name FlatGrid11 \ # Experiment id: FlatGrid11, HoleyGrid11, Fetch50k/100k/150k.
--render # Optional trajectory rendering (REACT, REACT_F, Random).
Acknowledgements
[Expand]
This work was partially funded by the Bavarian Ministry for Economic Affairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems and is part of the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus. An earlier version of this work was presented at the International Conference on Evolutionary Computation Theory and Applications (ECTA 2024) [1]. This work extends our conference paper with a thorough hyperparameter analysis, ablation studies investigating the impact of partial- and fidelity-based rewards, as well as a more robust assessment of the generated trajectories in terms of their optimality gap and demostration fidelities.
[1] Philipp Altmann, Céline Davignon, Maximilian Zorn, Fabian Ritz, Claudia Linnhoff-Popien, and Thomas Gabor, "REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning", in Proceedings of the 16th International Joint Conference on Computational Intelligence, IJCCI '24, pp. 127-138, 2024.
Citation
[Expand]
When using this repository you can cite it as:
@inproceedings{altmann2024react,
title = {REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning},
author = {Philipp Altmann and Céline Davignon and Maximilian Zorn and Fabian Ritz and Claudia Linnhoff-Popien and Thomas Gabor},
booktitle = {Proceedings of the 16th International Joint Conference on Computational Intelligence},
series = {IJCCI '24},
year = {2024},
pages = {127--138},
publisher = {SciTePress},
location = {Porto, Portugal},
doi = {10.5220/0013005900003837},
}



























