TowerTamers Project Status

Project Summary

TowerTamers is a reinforcement learning (RL) project focused on training an agent to navigate the Obstacle Tower environment, a 3D procedurally generated tower with increasing difficulty. We implement a custom Proximal Policy Optimization (PPO) algorithm enhanced with frame stacking and reward shaping to improve exploration and movement, alongside a Stable-Baselines3 PPO baseline for comparison. Our goal is to enable the agent to climb floors effectively, adapting to the environment’s challenges using both tailored and off-the-shelf RL techniques.

Approach

Our implementation uses Proximal Policy Optimization (PPO), a policy gradient method that balances stability and sample efficiency. The algorithm consists of several key components:

Core Algorithm

Custom Implementation Details (src/train.py)

Network Architecture

Stable-Baselines3 Version (src/train2.py)

Evaluation

We evaluated both implementations with the following results:

Custom PPO (src/train.py)

Custom PPO Rewards Figure 1: Episode rewards for custom PPO over ~10k steps, plateauing at 1.0 or 0.0.

Agent Behavior Figure 2: Screenshot of the custom PPO agent, often jumping without progress.

Stable-Baselines3 PPO (src/train2.py)

Stable-Baselines3 Rewards Figure 3: Episode rewards for Stable-Baselines3 PPO, showing similar low performance.

Remaining Goals and Challenges

Planned Improvements

  1. Reward Function Refinement
    • Add distance-based rewards for climbing
    • Explore additional reward shaping strategies
  2. Implementation Updates
    • Switch to “CnnPolicy” in Stable-Baselines3
    • Extend training duration to 100k+ steps
    • Implement comprehensive success metrics
  3. Evaluation Expansion
    • Quantify success rate (floors climbed)
    • Compare custom vs. Stable-Baselines3 performance

Current Challenges

Technical Issues

Learning Performance

Time Constraints

Resources