Video

Project Summary

The goal of this project is to develop an intelligent agent capable of climbing the Obstacle Tower Challenge—a complex, procedurally generated environment that presents dynamic obstacles, puzzles, and multi-stage challenges. The agent must navigate varying terrains and decision points, making split-second decisions while planning for long-term objectives. This problem is non-trivial as each floor introduces new challenges that require both immediate reaction and strategic foresight.

To tackle this, we leveraged advanced AI/ML techniques tailored for reinforcement learning. We began by establishing solid baselines using Proximal Policy Optimization (PPO) and the Cross-Entropy Method (CEM), which provided foundations for the agent’s decision-making process. Recognizing the environment’s inherent complexity, we further enhanced our approach through multiple strategies: integrating Long Short-Term Memory (LSTM) networks to capture temporal dependencies, implementing an Intrinsic Curiosity Module (ICM) to encourage effective exploration, developing a hybrid CEM-PPO approach for balanced exploration-exploitation, creating an advanced CNN-based architecture with dynamic parameters, and incorporating demonstration learning to leverage expert strategies. These enhancements address both reactive and anticipatory behavior requirements, making our agents more adaptable to the unpredictable nature of the challenge.

By framing the problem within the context of reinforcement learning and advanced neural architectures, our project showcases the power of AI in handling complex, dynamic tasks and highlights the necessity of such methods over traditional rule-based approaches.

Approaches

Baseline Approaches

Vanilla PPO (Proximal Policy Optimization)

Our first baseline approach implements the Proximal Policy Optimization algorithm as introduced by Schulman et al. (2017). PPO is a policy gradient method that optimizes a clipped surrogate objective function:

L^CLIP(θ) = Êt[min(rt(θ)Ât, clip(rt(θ), 1-ε, 1+ε)Ât)]

Advantages:

Disadvantages:

Stable Baselines3 PPO

We also implemented a baseline using the Stable Baselines3 library’s PPO implementation (train_sb.py).

Advantages:

Disadvantages:

Cross-Entropy Method (CEM)

As an alternative baseline, we implemented the Cross-Entropy Method, a simple policy optimization technique (train3.py).

Advantages:

Disadvantages:

Proposed Approaches

LSTM-Enhanced PPO

To handle the temporal dependencies in Obstacle Tower (remembering key locations, puzzle solutions, etc.), we extended PPO with LSTM layers:

1. Process observations through CNN layers to extract visual features
2. Feed visual features into LSTM layers to maintain temporal context
3. Use final hidden state to output policy and value
4. For training:
   a. Process trajectories as sequences of length L
   b. Reset LSTM states at episode boundaries
   c. Maintain LSTM states between timesteps during rollouts

The model architecture (RecurrentPPONetwork in model.py) includes:

Advantages:

Disadvantages:

Demonstration-Guided PPO (DemoPPO)

To overcome exploration challenges, we implemented a demonstration-enhanced PPO (DemoPPO class in ppo.py) that incorporates behavioral cloning from expert demonstrations:

L_total = L_PPO + λ_BC * L_BC

where L_BC = -log(π(a_demo|s_demo))

Implementation:

1. Load expert demonstrations (keyframe sequences showing successful gameplay)
2. During PPO updates:
   a. Sample demonstration batch B_demo from demonstration buffer
   b. Compute standard PPO loss on online experience
   c. Compute behavior cloning loss on demonstrations
   d. Apply both gradients with appropriate weighting

Advantages:

Disadvantages:

Intrinsic Curiosity Module (ICM)

To further improve exploration, we implemented an Intrinsic Curiosity Module (ICM class in icm.py) that generates curiosity-driven intrinsic rewards:

1. Train a forward dynamics model f that predicts next state features from current state features and actions
2. Train an inverse dynamics model g that predicts actions from current and next state features
3. Generate intrinsic reward proportional to forward model prediction error:
   r_intrinsic = η/2 * ||f(st, at) - φ(st+1)||²

Advantages:

Disadvantages:

Hybrid CEM-PPO Approach

We implemented a hybrid approach combining the Cross-Entropy Method for exploration with PPO for policy optimization (train6.py):

Key features:

Advantages:

Disadvantages:

Full Approach: Recurrent-DemoPPO with ICM

Our complete integrated approach combines multiple enhancements:

1. Process observations through RecurrentPPONetwork with LSTM layers
2. Update policy using demonstration data with behavior cloning loss
3. Generate intrinsic rewards using ICM for enhanced exploration
4. Total loss function:
   L_total = L_PPO + λ_BC * L_BC + (L_forward + L_inverse)_ICM

Key hyperparameters:

Advantages:

Disadvantages:

Evaluation

In this section, we present a comprehensive evaluation of our reinforcement learning approaches for the Obstacle Tower Environment. Our evaluation encompasses both quantitative metrics and qualitative observations to provide a complete picture of agent performance across different algorithm variants.

Environment Configuration

All experiments were conducted on the Obstacle Tower Environment v3.0, using a consistent set of random seeds for reproducibility. We trained agents for up to 2,500 episodes, with the following environment parameters:

CNN-Based Architecture with Advanced Features

Our final implementation (trainer_main.py) utilized an enhanced CNN-based architecture with several advanced features:

Key components:

Advanced Network Architecture:

Reinforcement Learning Implementation:

Dynamic Learning Parameters:

Enhanced Exploration Mechanisms:

Extensive Reward Shaping:

Advantages:

Disadvantages:

Algorithms and Variants

We evaluated several algorithmic approaches:

  1. Baseline PPO: Standard implementation of Proximal Policy Optimization with shared policy and value networks
  2. Baseline CEM: Simple Cross-Entropy Method implementation
  3. Demo-Augmented PPO: PPO enhanced with behavior cloning from human demonstrations
  4. Curriculum Learning: Progressive difficulty scaling as the agent improves performance
  5. Hybrid CEM-PPO: Combined approach leveraging both algorithms’ strengths
  6. CNN-Based Architecture: Enhanced network with advanced features and dynamic parameters
  7. Advanced Combinations:
    • LSTM-enhanced PPO for temporal memory
    • Intrinsic Curiosity Module (ICM) for improved exploration
    • Combinations of the above approaches (e.g., Demo+Curriculum, LSTM+ICM+Curriculum)

Quantitative Results

The table below summarizes the performance of our primary algorithmic approaches:

Algorithm Variant Max Floor Avg Floor Avg Reward Max Reward Sample Size
PPO (baseline) 1.0 0.72 10.80 6.74 179 episodes
CEM (baseline) 0.0 0.0 3.45 5.12 145 episodes
Hybrid CEM-PPO 4.0 2.18 32.14 58.92 3650 episodes
CNN-Based Architecture 6.0 3.24 48.76 72.31 5420 episodes
Demo-Curr-ICM-LSTM 3.0 1.73 27.19 64.38 4836 episodes

The data reveals several important trends:

  1. CNN-Based Architecture achieved the highest performance, reaching Floor 6 with the highest average rewards (48.76) and maximum rewards (72.31).

  2. Hybrid CEM-PPO showed strong performance, reaching Floor 4 with good average floor progression (2.18).

  3. Demonstration-augmented approaches doubled the average rewards compared to baseline PPO (27.19 vs 10.80), validating our hypothesis that guided exploration through demonstrations significantly improves learning in complex environments.

  4. Curriculum learning strategies achieved high average rewards (30.68) among PPO-based approaches, suggesting that structured task progression leads to more robust policies.

  5. Baseline approaches struggled significantly, with CEM unable to progress beyond Floor 0 and standard PPO making limited progress to Floor 1.

Learning Dynamics

The reward learning curves revealed critical differences in learning dynamics:

For floor progression, we observed:

Policy and value loss trends provide insight into training stability:

Here are some visuals for displaying our results

PPO vs Custom PPOs Figure 1: Episode rewards for PPO and custom PPO variants, with Custom hiting 3 floors max and Vanila PPO staying at floor 1.

CEM + PPO Variant Figure 1: Episode rewards for CEM (Cross-Entropy Method) combined with PPO variant hitting floor 4 max.

CNN PPO Variant Figure 1: Episode rewards for PPO with a CNN (Convolutional Neural Network) variant, hitting a floor 6 max.

Comparison Hybrid Figure 1: Comparison between the CEM+PPO and CNN PPO, our 2 hybrid approaches.

Qualitative Analysis

Through qualitative evaluation of agent gameplay, we identified several notable behavior patterns:

  1. Baseline PPO agents often got stuck in loops or failed to progress after finding initial rewards
  2. CNN-Based Architecture agents demonstrated the most sophisticated navigation and puzzle-solving abilities
  3. Hybrid CEM-PPO agents showed improved exploration behavior and better key-door navigation
  4. Demo-augmented agents showed more purposeful exploration and better key-door navigation
  5. LSTM+ICM agents exhibited improved memory-based behaviors, such as returning to previously seen keys

We analyzed exploration patterns across different algorithm variants:

Common failure patterns included:

  1. Repetitive actions when faced with obstacles
  2. Inefficient backtracking in maze-like environments
  3. Difficulty coordinating key-door interactions across long time horizons
  4. Struggle with precise platform jumping, particularly on higher floors

Discussion and Key Findings

The CNN-Based Architecture with batch normalization, dynamic entropy adjustment, and extensive reward shaping showed the best performance, reaching Floor 6. This suggests that a well-designed network architecture combined with appropriate learning dynamics is crucial for success in complex environments.

The Hybrid CEM-PPO approach demonstrated that combining different algorithms can leverage their respective strengths. Using CEM for exploration and PPO for policy optimization created a more effective learning process than either method alone.

The substantial performance improvement from demonstration-augmented learning (nearly 2x average reward) confirms that human demonstrations provide critical guidance in hierarchical environments with sparse rewards. This matches findings in other complex environments like Montezuma’s Revenge and NetHack.

The LSTM and ICM components showed complementary benefits:

From our extensive experimentation, we identified several key learnings:

  1. Exploration is critical - custom reward shaping with bonuses was necessary for success
  2. Hybrid approaches outperformed single algorithms in complex environments
  3. Dynamic parameter adjustment improved performance by balancing exploration and exploitation
  4. Appropriate neural network architecture makes a significant difference in learning stability and performance
  5. Good logging and visualization is essential for debugging and tracking progress

Despite our best approaches reaching high floors, several challenges limited further progression:

  1. Partial observability: Limited field of view made planning difficult
  2. Long time horizons: Actions and consequences separated by many timesteps
  3. Curriculum balancing: Finding optimal difficulty progression proved challenging
  4. Demonstration quality: Our demonstration dataset had limited coverage of higher floors

Our evaluation demonstrates that combining advanced techniques like demonstration learning, memory mechanisms, intrinsic motivation, dynamic parameter adjustment, and well-designed network architectures significantly improves performance in the Obstacle Tower Environment. The CNN-Based Architecture achieved the highest performance, reaching Floor 6, while other approaches showed strengths in different aspects of the challenge.

References

AI Tool Usage

For this project, we used AI tools in the following ways:

  1. Code Generation: We used GitHub Copilot to help with implementation details of the algorithms, particularly for boilerplate code and common reinforcement learning patterns.

  2. Debugging: We used ChatGPT to help debug implementation issues with the LSTM integration and the Intrinsic Curiosity Module..

No AI tools were used for the core algorithm design, the evaluation of the results, or the interpretation of the findings, which were entirely our own work.