SimbaV2
Hyperspherical Normalization for Scalable Deep Reinforcement Learning

Under Review

Hojoon Lee1$\dagger$,  Youngdo Lee1$\dagger$,  Takuma Seno2Donghu Kim1
Peter Stone2, 3Jaegul Choo1

1 KAIST  2 Sony AI  3 UT Austin 

$\dagger$: Equal contribution

Overview. SimbaV2 outperforms other RL algorithms, where performance scales as compute increases. The numbers below each dot indicate the update-to-data (UTD) ratio. SimbaV2, with UTD=1, achieves a performance of $0.848$, surpassing TD-MPC2 ($0.749$), the most computationally intensive version of Simba ($0.818$), and BRO ($0.807$). The results show normalized returns, averaged over $57$ continuous control tasks from MuJoCo, DMControl, MyoSuite, and HumanoidBench, each trained on $1$ million samples.

TL;DR

Stop worrying about algorithms, just change the network architecture to SimbaV2

Abstract

Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on $57$ continuous control tasks across $4$ domains.

SimbaV2 Architecture

We present SimbaV2, a novel RL architecture designed to stabilize non-stationary optimization by constraining weight, feature, and gradient norms. Building on the Simba architecture, which uses pre-layernorm residual blocks and weight decay to control the weight and feature norm growth, SimbaV2 introduces two modifications:
  • Hyperspherical Normalization. We replace all layer normalization with hyperspherical normalization (i.e., $\ell_2$-normalization) and project weights onto the unit-norm hypersphere after each gradient update. These changes ensure consistent effective learning rates across layers and eliminate the need for tuning weight regularization.

  • Distributional Value Estimation with Reward Scaling. To address unstable gradient norms from varying reward scales, we integrate a distributional critic and reward scaling to ensure the unit variance of target values for both actor and critic.

  • (a) RSNorm: Given an input observation $\boldsymbol{o}_t \in \mathbb{R}^{| \mathcal{O} |}$ and its running statistics $(\boldsymbol{\mu}_t, \boldsymbol{\sigma}_t^2)$, the observation is normalized as: $$ \boldsymbol{\bar{o}}_t = \text{RSNorm}(\boldsymbol{o}_t) = \frac{\boldsymbol{o}_t - \boldsymbol{\mu}_t}{\sqrt{\boldsymbol{\sigma}_t^2 + \epsilon}}. $$
  • (b) Shift & $\ell_2$-Norm: To retain magnitude information, we embed $\boldsymbol{\bar{o}}_t$ into an $(\vert \mathcal{O} \vert + 1)$-dimensional vector by concatenating a positive constant $c_\text{shift} > 0$, then apply $\ell_2$-normalization: $$ \widetilde{\boldsymbol{o}}_t =\ell_2\text{-Norm} (\bigl[\bar{\boldsymbol{o}}_t;\,c_{\text{shift}}\bigr]). $$
  • (c) Linear + Scaler: We then embed $\boldsymbol{\tilde{o}}_t$ using a linear layer $\boldsymbol{W}^0_h \in \mathbb{R}^{(|\mathcal{O}|+1) \times d_h}$ and a scaling vector $\boldsymbol{s}_h^0 \in \mathbb{R}^{d_h}$, and project back to the hypersphere: $$ \boldsymbol{h}_t^0 = \ell_2 \text{-Norm} (\boldsymbol{s}_h^0 \odot (\boldsymbol{W}_h^0 \; \mathrm{Norm} (\boldsymbol{\tilde{o}}_t))). $$
  • (d) MLP Block + $\ell_2$-Norm: Starting from the initial hyperspherical embedding $\boldsymbol{h}_t^0$, we apply $L$ consecutive blocks of non-linear transformations: $$ \boldsymbol{\tilde{h}}_t^l = \ell_2\text{-Norm} (\boldsymbol{W}_{h,2}^l \,\mathrm{ReLU}\bigl( (\boldsymbol{W}_{h,1}^l \,\boldsymbol{h}_t^l) \odot \boldsymbol{s}^l_h \bigr)). $$
  • (e) LERP + $\ell_2$-Norm: Each $l$-th block transforms $\boldsymbol{h}_t^l$ into $\boldsymbol{h}_t^{l+1}$ by linearly interpolating between the original input $\boldsymbol{h}_t^l$ and its non-linearly transformed output $\boldsymbol{\tilde{h}}_t^l$ with learnable interpolation vector $\boldsymbol{\alpha}^l$: $$ \boldsymbol{h}_t^{l+1} = \ell_2\text{-Norm}((\boldsymbol{1} - \boldsymbol{\alpha}^l) \odot \boldsymbol{h}_t^l + \boldsymbol{\alpha}^l \odot\boldsymbol{\tilde{h}}_t^l). $$

Scaling Network Size & UTD Ratio

In this work, we compare the scalability of SimbaV2 compared to Simba. We scale the number of model parameters by increasing the width of the critic network, and scale compute by increasing the update-to-data (UTD) ratio with and without periodic reset. For an empirical analysis, we define two challenging benchmark subsets: DMC-Hard ($7$ tasks involving $\texttt{dog}$ and $\texttt{humanoid}$ embodiments) and HBench-Hard ($5$ tasks: $\texttt{run}$, $\texttt{balance-simple}$, $\texttt{sit-hard}$, $\texttt{stair}$, $\texttt{walk}$). On both benchmarks, SimbaV2 benefit from both increased model size and UTD ratio, while Simba plateaus at some moment. Notably, SimbaV2 scales smoothly alongside UTD ratio even without reset, where using reset slightly degrades its performance.

Empiricial Analysis: Training Stability

We track average return and $4$ metrics during training to understand the learning dynamics of SimbaV2 and Simba:

  • (a) Average normalized return across tasks

  • (b) Weighted sum of $\ell_2$-norms of all intermediate features in critics

  • (c) Weighted sum of $\ell_2$-norms of all critic parameters

  • (d) Weighted sum of $\ell_2$-norms of all gradients in critics

  • (e) Effective learning rate (ELR) of the critics
On both environments, SimbaV2 maintains stable norms and ELR, while Simba exhibits divergent fluctuations.

Benchmark Summary

SimbaV2, with an update-to-data (UTD) ratio of $2$, outperforms state-of-the-art RL algorithms across diverse continuous control benchmarks using fixed hyperparameters across all domains. SimbaV2 delivers competitive performance in both online and offline RL while requiring significantly less training computation and offering faster inference times.

SimbaV2 with Online RL

We evaluate SimbaV2 (UTD=2) on $57$ control tasks across $4$ task domains: MuJoCo, DMControl, MyoSuite, and HumanoidBench.


MuJoCo

DMControl


MyoSuite

HumanoidBench

Paper

SimbaV2: Hyperspherical Normalization for Scalable Deep Reinforcement Learning
Hojoon Lee*, Youngdo Lee*, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo

arXiv preprint

View on arXiv

Citation

If you find our work useful, please consider citing the paper as follows:

@article{lee2025simbav2,
         title={Hyperspherical Normalization for Scalable Deep Reinforcement Learning}, 
         author={Hojoon Lee and Youngdo Lee and Takuma Seno and Donghu Kim and Peter Stone and Jaegul Choo},
         journal={arXiv preprint arXiv:2502.15280},
         year={2025},
}