Abstract
We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.
| Originalsprache | Englisch |
|---|---|
| Seiten (von - bis) | 115-128 |
| Seitenumfang | 14 |
| Fachzeitschrift | The journal of artificial intelligence research |
| Jahrgang | 67.2020 |
| Ausgabenummer | 1 |
| DOIs | |
| Publikationsstatus | Veröffentlicht - 23 Jan. 2020 |
Bibliographische Notiz
Funding Information:This work has been supported by the Austrian Science Fund (FWF): I 3437-N33 in the framework of the CHIST-ERA ERA-NET (DELTA project). The author would like to thank Ronan Fruit, Alessandro Lazaric, and Matteo Pirotta for discussion as well as Sadegh Talebi, Chen-Yu Wei, and three anonymous reviewers for pointing out errors in a previous version of the paper.
Publisher Copyright:
© 2020 AI Access Foundation. All rights reserved.
Publikationen
- 3 Zitationen
- 1 Poster
-
Regret Bounds for Reinforcement Learning via Markov Chain Concentration
Ortner, R., 26 Aug. 2021.Publikation: Konferenzbeitrag › Poster › Forschung › Begutachtung
Aktivitäten
- 1 Mündliche Präsentation
-
Regret Bounds for Reinforcement Learning via Markov Chain Concentration
Ortner, R. (Redner)
26 Aug. 2021Aktivität: Gespräch oder Vortrag › Mündliche Präsentation
Dieses zitieren
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver