Average Reward Reinforcement Learning

vaneet_photo

Most real world problems have infinite horizon average reward objectives, while this case has not been as well understood. The key reason is that the contraction operation that gives the key results in the discounted setup no longer holds. In our work, we aim to give the foundations of average reward reinforcement learning. Further, many real-world problems require optimization of an objective that is non-linear in cumulative rewards, e.g., fairness objectives in scheduling. Finally, most engineering problems have constraints, e.g., reaching safely, average power constraints. In this work, we aim to innovate on efficient algorithms taking into account non-linear objectives and constraints. Reinforcement Learning algorithms such as DQN owe their success to Markov Decision Processes, and the fact that maximizing the sum of rewards allows using backward induction and reduce to the Bellman optimality equation. These fail to hold in the presence of non-linear objectives and/or constraints making the analysis challenging. We have proposed multiple forms of algorithms with provable guarantees for these problems. As shown alongside, for a network traffic optimization problem, the proposed Actor-Critic (IFRL AC) and Policy Gradient (IFRL PG) approaches significantly outperform the standard approaches that do not explicitly account for non-linearity. In the Table below, we consider two forms of algorithms with a mention of the results with either concave utility functions or constraints or both. The model-based approaches use posterior or optimistic sampling approaches, the model-free approaches are based on policy gradients or primal-dual approaches. X mentions that we have the results for those cases (detailed papers in the below), and the others represent the scenarios where we are working towards efficient results.

	Linear Utility	Concave Utility	Constraints	Concave Utility and Constraints
Model-Based	Prior Works	X	X	X
Model-Free Parametrized Setup	X		X

Model-Based Approach:

We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. For this, we propose a model-based learning algorithm that also achieves zero constraint violations. Assuming that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures, we solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We use Bellman error-based analysis for tabular infinite-horizon setups which allows analyzing stochastic policies. Combining the Bellman error-based analysis and tighter optimization equation, we obtain a order-optimal regret guarantee for objective. The proposed method can be applied for optimistic algorithms to obtain high-probability regret bounds and also be used for posterior sampling algorithms to obtain a loose Bayesian regret bounds but with significant improvement in computational complexity.

Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal, "Concave Utility Reinforcement Learning with Zero-Constraint Violations," Transactions on Machine Learning Research, Dec 2022.
Mridul Agarwal, Qinbo Bai, Vaneet Aggarwal, "Regret Guarantees for Model-Based Reinforcement Learning with Long-Term Average Constraints," in Proc. UAI, Aug 2022.
Mridul Agarwal and Vaneet Aggarwal, "Reinforcement Learning for Joint Optimization of Multiple Rewards," Accepted to Journal of Machine Learning Research, Jul 2022.
Mridul Agarwal, Vaneet Aggarwal, and Tian Lan, "Multi-Objective Reinforcement Learning with Non-Linear Scalarization," in Proc. AAMAS, May 2022 (26% acceptance rate, 166/629).

Model-free algorithms Parametrized Setup with Constraints:

We propose the first policy-gradient based algorithm with general parameterization in the average reward setup and establish a sublinear regret of the proposed algorithm. The results are extended to obtain the objective regret and constraint violations in the presence of constraints.

Qinbo Bai, Washim Uddin Mondal, and Vaneet Aggarwal, "Learning Infinite Horizon Average Reward Constrained Markov Decision Processes with Primal-Dual Policy Gradient Algorithm," Submitted.
Qinbo Bai, Washim Uddin Mondal, and Vaneet Aggarwal, "Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes,"Sept 2023.

Tabular Model-free algorithms with Constraints:

Modification of Q-learning based algorithms with constraints is an important problem. The average constraints make optimal policy non-deterministic making Q-learning directly not applicable while the peak constraint need to be learnt efficiently. In the following works, convergence/regret guarantees are shown for Q-learning based algorithms with peak/average constraints.

Ather Gattami, Qinbo Bai, and Vaneet Agarwal, "Reinforcement Learning for Multi-Objective and Constrained Markov Decision Processes," in Proc. AISTATS, Apr 2021 (29.8% acceptance rate, 455/1527)
Qinbo Bai, Vaneet Aggarwal, Ather Gattami, "Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints," Journal of Machine Learning Research, 24(60), pp.1−25, Apr 2023.

Home