SimBa
Simplicity Bias for Scaling Parameters in Deep Reinforcement Learning

Preprint

Hojoon Lee1, 2*,  Dongyoon Hwang1*,  Donghu Kim1
Hyunseung Kim1Jun Jet Tai2, 3Kaushik Subramanian2Peter R. Wurman2
Jaegul Choo1Peter Stone2, 4Takuma Seno2

1 KAIST  2 Sony AI  3 Coventry University  4 UT Austin 

*Equal contribution

TL;DR

Stop worrying about algorithms, just change the network architecture to SimBa

Overview. SimBa infuses simplicity bias through architectural changes, without modifying the underlying deep RL algorithm.
(a) SimBa enhaces sample efficiency: Sample efficiency across various RL algorithms, including off-policy model-free (SAC), off-policy model-based (TD-MPC2), on-policy model-free (PPO), and unsupervised (METRA) RL methods. (b) Off-policy RL Benchmark When applied to SAC, SimBa matches or surpasses state-of-the-art off-policy RL methods with minimal computational overhead across 51 continuous control tasks, by only modifying the network architecture and scaling up the number of network parameters.

Abstract

We introduce SimBa, an architecture designed to inject simplicity bias for scaling up the parameters in deep RL. Simba consists of three components: (i) standardizing input observations with running statistics, (ii) incorporating residual feedforward blocks to provide a linear pathway from the input to the output, and (iii) applying layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms—including off-policy, on-policy, and unsupervised methods—is consistently improved. Moreover, when SimBa is integrated into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across 51 tasks from DMC, MyoSuite, and HumanoidBench, solely by modifying the network architecture. These results demonstrate SimBa's broad applicability and effectiveness across diverse RL algorithms and environments.

SimBa Architecture

SimBa comprises three components: Running Statistics Normalization, Residual Feedforward Blocks, and Post-Layer Normalization. These components lower the network's functional complexity, enhancing generalization for highly overparameterized configurations.

SimBa with Off-Policy RL

We evaluate SAC + SimBa on 51 control tasks across 3 task domains: DMC, MyoSuite, and HumanoidBench.

SimBa with On-Policy RL

Comparison of PPO with and without SimBa for Craftax.

PPO

PPO + SimBa

SimBa with Unsupervised RL

Comparison of METRA with and without SimBa for Humanoid in DMC.

METRA

METRA + SimBa

Paper

SimBa: Simplicity Bias for Scaling Up Parameters for Deep RL
Hojoon Lee*, Dongyoon Hwang*, Donghu Kim,
Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R. Wurman,
Jaegul Choo, Peter Stone, Takuma Seno


arXiv preprint

Citation

If you find our work useful, please consider citing the paper as follows:

@article{lee2024simba, title={SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning}, author={Hojoon Lee and Dongyoon Hwang and Donghu Kim and Hyunseung Kim and Jun Jet Tai and Kaushik Subramanian and Peter R. Wurman and Jaegul Choo and Peter Stone and Takuma Seno}, journal={arXiv preprint arXiv:2410.09754}, year={2024} }