Philipp Altmann

Philipp Altmann

research

Surrogate Fitness Metrics for Interpretable Reinforcement Learning

Implementations and video demonstrations acompanying research on revealing evolutionary action consequence trajectories (REACT) for interpretable reinforcement learning.

Revealing Evolutionary Action Consequence Trajectories

[Expand]

REACT

REACT optimizes interpretable demonstrations for trained reinforcement learning policies by evolving trajectory encodings against surrogate fitness terms (globaldiversity, local diversity, and certainty). The figure above summarizes the training, evolutionary optimization, and evaluation pipeline used throughout the paper.

REACT Demonstrations

[Expand]

FlatGrid11 (PPO trained for 35k steps)

REACT Fidelity Random Train
FlatGrid REACT Demonstration FlatGid Fidelity Demonstration FlatGid Random Demonstration FlatGrid Train Demonstration
FlatGrid REACT Heatmap FlatGid Fidelity Heatmap FlatGid Random Heatmap -

HoleyGrid11 (PPO trained for 150k steps)

REACT Fidelity Random Train
HoleyGrid-REACT HoleyGrid-Fidelity HoleyGrid-Random HoleyGrid-Train
HoleyGrid REACT Heatmap HoleyGridFidelity Heatmap HoleyGrid Random Heatmap -

Continous Robot Control

Training REACT Fidelity Random Train Trajectories
SAC (50k) Fetch50k REACT Fetch50k Fidelity Fetch50k Random Fetch50k Train Fetch50k Trajectories
SAC (100k) Fetch100k REACT Fetch100k Fidelity Fetch100k Random Fetch100k Train Fetch100k Trajectories
SAC (150k) Fetch150k REACT Fetch150k Fidelity Fetch150k Random Fetch150k Train Fetch150k Trajectories

For further evaluation results regarding the resuling demonstration fidelity, reward optimality gap, and comparisons of different fitness particles and their influence, please refer to the full paper.

Reproduce Results

[Expand]

Setup

Clone this repository and run pip install -e . to install this project in editable mode.

Paramters

env_name steps trainseed model pop_size iterations enconding_length
FlatGrid11 35000 42 PPO 10 40 6
HoleyGrid11 150000 33 PPO 10 40 6
FetchReach 50000 42 SAC 10 40 9
FetchReach 100000 42 SAC 10 40 9
FetchReach 150000 42 SAC 10 40 9

Train the evaluated policy

react train --env-name {{env_name}} --name train --model {{model}} --steps {{steps}} --env-seed [trainseed]

models and videos are saved to experiments/model/<env_name>

Evaluate the resulting policy

react run --env-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --seed {{seed}} 

Videos are saved to experiments/videos.

Optimize REACT demonstrations

react evo --env-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --name {{env_name}}-0 --seed {{seed}} --pop-size {{pop_size}} --iterations {{iterations}} --encoding-length {{enconding_length}} --plot-frequency {{plot_frequency}}  --is-elitist  --crossover 0.75 --mutation 0.5

To view the resulting demonstrations run:

react plot --env-name {{env_name}} --exp-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --render

To compare the results with random search run:

react baseline1 --env-name {{env_name}} --saved-model train1_{{model}} --checkpoint {{steps}} --pop-size {{pop_size}} --iterations 1 --encoding-length {{encoding_length}} --name FlatGrid11 --seed {{trainseed}} --plot

Random Seeds

For evaluation we used the following seeds: 42, 13, 24, 18, 46, 19, 28, 32, 91, 12

To reproduce all seeds run:

./scripts/{{env_name}}/run_train.sh
./scripts/{{env_name}}/run_react.sh
./scripts/{{env_name}}/run_ablations.sh

To reproduce the plots, use:

react plot --env-name FlatGrid11 --render 
react plot --env-name HoleyGrid11 --render 
react plot --env-name Fetch50k --render 
react plot --env-name Fetch100k --render 
react plot --env-name Fetch150k --render --training

CLI Reference

[Expand]

Training

react train \                                  # Train an RL policy.
  --env-name FlatGrid11 \                      # Environment: FlatGrid11, HoleyGrid11, or FetchReach.
  --name train \                               # Run/model name used for logs and saved artifacts.
  --model ppo \                                # RL algorithm (e.g., ppo, sac).
  --steps 35000 \                              # Number of training steps.
  --save-freq 5000 \                           # Optional checkpoint interval (omit for no checkpoints).
  --env-seed 42 \                              # Optional env layout seed (relevant for HoleyGrid).
  --render                                     # Optional real-time environment rendering.

Running a trained model

react run \                                    # Run inference with a trained policy.
  --env-name FlatGrid11 \                      # Environment to evaluate.
  --saved-model train_ppo \                    # Saved model name (e.g., train_ppo).
  --nr-episodes 5 \                            # Number of evaluation episodes.
  --checkpoint 35000 \                         # Optional checkpoint step to load.
  --env-seed 42                                # Optional env layout seed for HoleyGrid.

Run REACT

react evo \                                    # Run evolutionary optimization for REACT demonstrations.
  --env-name FlatGrid11 \                      # Environment to optimize on.
  --saved-model train_ppo \                    # Trained base policy.
  --checkpoint 35000 \                         # Optional checkpoint step to load.
  --name FlatGrid11-0 \                        # Experiment name for outputs and plots.
  --pop-size 10 \                              # Population size.
  --iterations 40 \                            # Evolution iterations (0 behaves like random search).
  --encoding-length 6 \                        # Genotype/trajectory encoding length.
  --plot-frequency 5 \                         # Plot update interval (0 disables plotting).
  --crossover 0.75 \                           # Crossover probability.
  --mutation 0.5 \                             # Mutation probability.
  --is-elitist \                               # Keep strongest individuals between generations.
  --env-seed 42 \                              # Optional env layout seed for HoleyGrid.
  --seed 42 \                                  # Random seed for the optimizer.
  --w1 1 --w2 1 --w3 1 --w4 1 \                # Fitness weights: global, local, certainty, min-distance.
  --render                                     # Optional rendering during optimization.

These weights (w1,w2,w3,w4) allow the configuration of various ablations:

w1=1 w2=1 w3=1 w4=1            # REACT: full objective.
w1=1 w2=0 w3=0 w4=0            # REACT_G: global diversity only.
w1=0 w2=1 w3=1 w4=1            # REACT_D: local distance-focused variant.
w1=1 w2=1 w3=1 w4=0            # REACT_P: simple sum without distance term.
w1=0 w2=1 w3=0 w4=0            # REACT_L: local diversity only.
w1=0 w2=0 w3=1 w4=0            # REACT_C: certainty only.
w1=0 w2=0 w3=0 w4=0            # REACT_F: fidelity-only baseline.

Plot results

react plot \                                   # Generate paper-style plots and optional rollouts.
  --env-name FlatGrid11 \                      # Experiment id: FlatGrid11, HoleyGrid11, Fetch50k/100k/150k.
  --render                                     # Optional trajectory rendering (REACT, REACT_F, Random).

Acknowledgements

[Expand]

This work was partially funded by the Bavarian Ministry for Economic Affairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems and is part of the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus. An earlier version of this work was presented at the International Conference on Evolutionary Computation Theory and Applications (ECTA 2024) [1]. This work extends our conference paper with a thorough hyperparameter analysis, ablation studies investigating the impact of partial- and fidelity-based rewards, as well as a more robust assessment of the generated trajectories in terms of their optimality gap and demostration fidelities.

[1] Philipp Altmann, Céline Davignon, Maximilian Zorn, Fabian Ritz, Claudia Linnhoff-Popien, and Thomas Gabor, "REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning", in Proceedings of the 16th International Joint Conference on Computational Intelligence, IJCCI '24, pp. 127-138, 2024.

Citation

[Expand]

When using this repository you can cite it as:

@inproceedings{altmann2024react,
  title = {REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning},
  author = {Philipp Altmann and Céline Davignon and Maximilian Zorn and Fabian Ritz and Claudia Linnhoff-Popien and Thomas Gabor},
  booktitle = {Proceedings of the 16th International Joint Conference on Computational Intelligence},
  series = {IJCCI '24},
  year = {2024},
  pages = {127--138},
  publisher = {SciTePress},
  location = {Porto, Portugal},
  doi = {10.5220/0013005900003837},
}
, ,, in , , in vol. no. , pp. , . . [Code] [PDF] [Preprint]