Abstract
The state occupancy measure and successor state measure are important theoretical tools in reinforcement learning that represent the distribution of future states. However, while these tools see extensive use in theory and theoretically-motivated algorithms, they have not seen significant use in practical settings because existing algorithms for learning SOM and SSM are high-variance or unstable in practice. To address this, we explore using diffusion models as a representation for the state successor measure. We find that enforcing the Bellman flow constraints on a diffusion model leads to a temporal difference update on the predicted noise, similar to the standard TD-learning update on the predicted reward. As a result, our method has the expressive power of a diffusion model, and a low variance that is comparable to that of TD-learning. To demonstrate this method's practicality, we propose a simple reinforcement learning algorithm based on regularizing the learned SSM. We test the proposed method on an array of offline RL problems, and find it has the highest average performance of all methods in the literature, as well as achieving state-of-the-art performance on several environments.
| Original language | English |
|---|---|
| Publication status | Published - 17 Sept 2025 |
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver