Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Publikation: Beitrag in FachzeitschriftArtikelForschungBegutachtung

3 Zitate (Scopus)


We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

Seiten (von - bis)115-128
Fachzeitschrift The journal of artificial intelligence research
PublikationsstatusVeröffentlicht - 23 Jan. 2020

Bibliographische Notiz

Funding Information:
This work has been supported by the Austrian Science Fund (FWF): I 3437-N33 in the framework of the CHIST-ERA ERA-NET (DELTA project). The author would like to thank Ronan Fruit, Alessandro Lazaric, and Matteo Pirotta for discussion as well as Sadegh Talebi, Chen-Yu Wei, and three anonymous reviewers for pointing out errors in a previous version of the paper.

Publisher Copyright:
© 2020 AI Access Foundation. All rights reserved.

Dieses zitieren