Discounted Reward Reinforcement Learning

vaneet_photo

Sample Complexity Guarantees for Discounted MDP in the model-free setups are far from lower bounds in many setups. In these works, we show provide the state of art sample complexity results for policy-gradient based algorithms, actor-critic algorithms in the absence of constraints. Further, novel approaches are proposed to account for constraints in both tabular and parametrized setups. Finally, we consider the case where the objective function is a concave utility. The plot on the alongside compares the joint objective of the proposed policy gradient algorithms and the RE-INFORCE algorithm for increasing number of queues, M, and varied number of samples, N, to compute gradient estimates for an application of queueing with fairness objectives. The REINFORCE algorithm is able to learn policies which improves the function value, but it does not achieves the policies as good as policies which our algorithm learns.


Sample Complexity Analysis for Model-Free Algorithms

Despite empirical effectiveness of model-free algorithms, their theoretical underpinnings remain relatively unexplored, especially with neural network parametrization. In this work, we aim to come up with novel algorithms, as well as find the state of art sample complexity guarantees for such algorithms.


Sample Complexity for Constrained Reinforcement Learning in Tabular Setup

We use randomized primal-dual approach to solve reinforcement learning with constraints. We achieve zero constraint violations with order-optimal objective regret.


Sample Complexity for Constrained Reinforcement Learning in Parametrized Setup

We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation while achieving state of the art convergence results for the objective value function. For general policy parametrization, we prove convergence of value function to global optimal upto an approximation error due to restricted policy class.


Impact of Concave Utility

We use policy gradient approach with concave objective functions, that allows us to go to large state space. We note that the gradient estimate is biased making the problem challenging. For the tabular setup, randomized primal-dual approach is used to handle both concave utility and constraints.


Home