Discounted Reward Reinforcement Learning

vaneet_photo

Sample Complexity Guarantees for Discounted MDP in the model-free setups are far from lower bounds in many setups. In these works, we show provide the state of art sample complexity results for policy-gradient based algorithms, actor-critic algorithms in the absence of constraints. Further, novel approaches are proposed to account for constraints in both tabular and parametrized setups. Finally, we consider the case where the objective function is a concave utility. The plot on the alongside compares the joint objective of the proposed policy gradient algorithms and the RE-INFORCE algorithm for increasing number of queues, M, and varied number of samples, N, to compute gradient estimates for an application of queueing with fairness objectives. The REINFORCE algorithm is able to learn policies which improves the function value, but it does not achieves the policies as good as policies which our algorithm learns.

Sample Complexity Analysis for Model-Free Algorithms

Despite empirical effectiveness of model-free algorithms, their theoretical underpinnings remain relatively unexplored, especially with neural network parametrization. In this work, we aim to come up with novel algorithms, as well as find the state of art sample complexity guarantees for such algorithms.

Washim Uddin Mondal and Vaneet Aggarwal, " Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes," Oct 2023.
Mudit Gaur, Amrit Singh Bedi, Di Wang, and Vaneet Aggarwal, " On the Global Convergence of Natural Actor-Critic with Two-layer Neural Network Parametrization ," Jun 2023.
Mudit Gaur, Vaneet Aggarwal, and Mridul Agarwal, " On the Global Convergence of Fitted Q-Iteration with Two-layer Neural Network Parametrization," in Proc. ICML, Jul 2023.

Sample Complexity for Constrained Reinforcement Learning in Tabular Setup

We use randomized primal-dual approach to solve reinforcement learning with constraints. We achieve zero constraint violations with order-optimal objective regret.

Qinbo Bai, Amrit Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal, "Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach," in Proc. AAAI, Feb 2022

Sample Complexity for Constrained Reinforcement Learning in Parametrized Setup

We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation while achieving state of the art convergence results for the objective value function. For general policy parametrization, we prove convergence of value function to global optimal upto an approximation error due to restricted policy class.

Qinbo Bai, Amrit Singh Bedi, and Vaneet Aggarwal, "Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm," in Proc. AAAI, Feb 2023.

Impact of Concave Utility

We use policy gradient approach with concave objective functions, that allows us to go to large state space. We note that the gradient estimate is biased making the problem challenging. For the tabular setup, randomized primal-dual approach is used to handle both concave utility and constraints.

Qinbo Bai, Amrit Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal, "Achieving Zero Constraint Violation for Constrained Concave Utility Reinforcement Learning via Primal-Dual Approach," Submitted to JAIR, 2023
Qinbo Bai, Mridul Agarwal, and Vaneet Aggarwal, "Joint Optimization of Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm," Journal of Artificial Intelligence Research 74 (2022) 1565-1597, Aug 2022.

Home