Surrogate Fitness Metrics for Interpretable Reinforcement Learning

Revealing Evolutionary Action Consequence Trajectories

[Expand]

REACT

REACT optimizes interpretable demonstrations for trained reinforcement learning policies by evolving trajectory encodings against surrogate fitness terms (globaldiversity, local diversity, and certainty). The figure above summarizes the training, evolutionary optimization, and evaluation pipeline used throughout the paper.

REACT Demonstrations

[Expand]

FlatGrid11 (PPO trained for 35k steps)

REACT	Fidelity	Random	Train

			-

HoleyGrid11 (PPO trained for 150k steps)

REACT	Fidelity	Random	Train

HoleyGrid REACT Heatmap			-

Continous Robot Control

Training	REACT	Fidelity	Random	Train	Trajectories
SAC (50k)
SAC (100k)
SAC (150k)

For further evaluation results regarding the resuling demonstration fidelity, reward optimality gap, and comparisons of different fitness particles and their influence, please refer to the full paper.

Reproduce Results

[Expand]

Setup

Clone this repository and run pip install -e . to install this project in editable mode.

Paramters

env_name	steps	trainseed	model	pop_size	iterations	enconding_length
FlatGrid11	35000	42	PPO	10	40	6
HoleyGrid11	150000	33	PPO	10	40	6
FetchReach	50000	42	SAC	10	40	9
FetchReach	100000	42	SAC	10	40	9
FetchReach	150000	42	SAC	10	40	9

Train the evaluated policy

react train --env-name {{env_name}} --name train --model {{model}} --steps {{steps}} --env-seed [trainseed]

models and videos are saved to experiments/model/<env_name>

Evaluate the resulting policy

react run --env-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --seed {{seed}}

Videos are saved to experiments/videos.

Optimize REACT demonstrations

react evo --env-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --name {{env_name}}-0 --seed {{seed}} --pop-size {{pop_size}} --iterations {{iterations}} --encoding-length {{enconding_length}} --plot-frequency {{plot_frequency}}  --is-elitist  --crossover 0.75 --mutation 0.5

To view the resulting demonstrations run:

react plot --env-name {{env_name}} --exp-name {{env_name}} --saved-model train_{{model}} --checkpoint {{steps}} --render

To compare the results with random search run:

react baseline1 --env-name {{env_name}} --saved-model train1_{{model}} --checkpoint {{steps}} --pop-size {{pop_size}} --iterations 1 --encoding-length {{encoding_length}} --name FlatGrid11 --seed {{trainseed}} --plot

Random Seeds

For evaluation we used the following seeds: 42, 13, 24, 18, 46, 19, 28, 32, 91, 12

To reproduce all seeds run:

./scripts/{{env_name}}/run_train.sh
./scripts/{{env_name}}/run_react.sh
./scripts/{{env_name}}/run_ablations.sh

To reproduce the plots, use:

react plot --env-name FlatGrid11 --render 
react plot --env-name HoleyGrid11 --render 
react plot --env-name Fetch50k --render 
react plot --env-name Fetch100k --render 
react plot --env-name Fetch150k --render --training

CLI Reference

[Expand]

Training

react train \                                  # Train an RL policy.
  --env-name FlatGrid11 \                      # Environment: FlatGrid11, HoleyGrid11, or FetchReach.
  --name train \                               # Run/model name used for logs and saved artifacts.
  --model ppo \                                # RL algorithm (e.g., ppo, sac).
  --steps 35000 \                              # Number of training steps.
  --save-freq 5000 \                           # Optional checkpoint interval (omit for no checkpoints).
  --env-seed 42 \                              # Optional env layout seed (relevant for HoleyGrid).
  --render                                     # Optional real-time environment rendering.

Running a trained model

react run \                                    # Run inference with a trained policy.
  --env-name FlatGrid11 \                      # Environment to evaluate.
  --saved-model train_ppo \                    # Saved model name (e.g., train_ppo).
  --nr-episodes 5 \                            # Number of evaluation episodes.
  --checkpoint 35000 \                         # Optional checkpoint step to load.
  --env-seed 42                                # Optional env layout seed for HoleyGrid.

Run REACT

react evo \                                    # Run evolutionary optimization for REACT demonstrations.
  --env-name FlatGrid11 \                      # Environment to optimize on.
  --saved-model train_ppo \                    # Trained base policy.
  --checkpoint 35000 \                         # Optional checkpoint step to load.
  --name FlatGrid11-0 \                        # Experiment name for outputs and plots.
  --pop-size 10 \                              # Population size.
  --iterations 40 \                            # Evolution iterations (0 behaves like random search).
  --encoding-length 6 \                        # Genotype/trajectory encoding length.
  --plot-frequency 5 \                         # Plot update interval (0 disables plotting).
  --crossover 0.75 \                           # Crossover probability.
  --mutation 0.5 \                             # Mutation probability.
  --is-elitist \                               # Keep strongest individuals between generations.
  --env-seed 42 \                              # Optional env layout seed for HoleyGrid.
  --seed 42 \                                  # Random seed for the optimizer.
  --w1 1 --w2 1 --w3 1 --w4 1 \                # Fitness weights: global, local, certainty, min-distance.
  --render                                     # Optional rendering during optimization.

These weights (w1,w2,w3,w4) allow the configuration of various ablations:

w1=1 w2=1 w3=1 w4=1            # REACT: full objective.
w1=1 w2=0 w3=0 w4=0            # REACT_G: global diversity only.
w1=0 w2=1 w3=1 w4=1            # REACT_D: local distance-focused variant.
w1=1 w2=1 w3=1 w4=0            # REACT_P: simple sum without distance term.
w1=0 w2=1 w3=0 w4=0            # REACT_L: local diversity only.
w1=0 w2=0 w3=1 w4=0            # REACT_C: certainty only.
w1=0 w2=0 w3=0 w4=0            # REACT_F: fidelity-only baseline.

Plot results

react plot \                                   # Generate paper-style plots and optional rollouts.
  --env-name FlatGrid11 \                      # Experiment id: FlatGrid11, HoleyGrid11, Fetch50k/100k/150k.
  --render                                     # Optional trajectory rendering (REACT, REACT_F, Random).

Acknowledgements

[Expand]

This work was partially funded by the Bavarian Ministry for Economic Affairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems and is part of the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus. An earlier version of this work was presented at the International Conference on Evolutionary Computation Theory and Applications (ECTA 2024) [1]. This work extends our conference paper with a thorough hyperparameter analysis, ablation studies investigating the impact of partial- and fidelity-based rewards, as well as a more robust assessment of the generated trajectories in terms of their optimality gap and demostration fidelities.

[1] Philipp Altmann, Céline Davignon, Maximilian Zorn, Fabian Ritz, Claudia Linnhoff-Popien, and Thomas Gabor, "REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning", in Proceedings of the 16th International Joint Conference on Computational Intelligence, IJCCI '24, pp. 127-138, 2024.

Citation

[Expand]

When using this repository you can cite it as:

@inproceedings{altmann2024react,
  title = {REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning},
  author = {Philipp Altmann and Céline Davignon and Maximilian Zorn and Fabian Ritz and Claudia Linnhoff-Popien and Thomas Gabor},
  booktitle = {Proceedings of the 16th International Joint Conference on Computational Intelligence},
  series = {IJCCI '24},
  year = {2024},
  pages = {127--138},
  publisher = {SciTePress},
  location = {Porto, Portugal},
  doi = {10.5220/0013005900003837},
}