TY - JOUR
T1 - Online Regret Bounds for Satisficing in Markov Decision Processes
AU - Hajiabolhassan, Hossein
AU - Ortner, Ronald
PY - 2025/5/9
Y1 - 2025/5/9
N2 - We consider general reinforcement learning under the average reward criterion in Markov decision processes (MDPs), when the learner’s goal is not to learn an optimal policy, but accepts any policy whose average reward is above a given satisfaction level
. We show that with this more modest objective, it is possible to give algorithms that only have constant regret with respect to the level
, provided that there is a policy above this level. This is a generalization of known results from the bandit setting to MDPs. Further, we present a more general algorithm that achieves the best of both worlds: If the optimal policy has average reward above
, this algorithm has bounded regret with respect to
. On the other hand, if all policies are below
, then the expected regret with respect to the optimal policy is bounded as for the UCRL2 algorithm.
AB - We consider general reinforcement learning under the average reward criterion in Markov decision processes (MDPs), when the learner’s goal is not to learn an optimal policy, but accepts any policy whose average reward is above a given satisfaction level
. We show that with this more modest objective, it is possible to give algorithms that only have constant regret with respect to the level
, provided that there is a policy above this level. This is a generalization of known results from the bandit setting to MDPs. Further, we present a more general algorithm that achieves the best of both worlds: If the optimal policy has average reward above
, this algorithm has bounded regret with respect to
. On the other hand, if all policies are below
, then the expected regret with respect to the optimal policy is bounded as for the UCRL2 algorithm.
UR - https://pureadmin.unileoben.ac.at/portal/en/publications/online-regret-bounds-for-satisficing-in-markov-decision-processes(209c4ac5-b61e-4531-95e6-e242465aa762).html
U2 - 10.1287/moor.2023.0275
DO - 10.1287/moor.2023.0275
M3 - Article
SN - 0364-765X
VL - ??? Stand: 30. September 2025
JO - Mathematics of Operations Research
JF - Mathematics of Operations Research
ER -