Reinforcement Learning: Concave Utilities and Constraints:


Many real-world problems require optimization of an objective that is non-linear in cumulative rewards, e.g., fairness objectives in scheduling. Further, most engineering problems have constraints, e.g., reaching safely, average power constraints. In this work, we aim to innovate on efficient algorithms taking into account non-linear objectives and constraints. Reinforcement Learning algorithms such as DQN owe their success to Markov Decision Processes, and the fact that maximizing the sum of rewards allows using backward induction and reduce to the Bellman optimality equation. These fail to hold in the presence of non-linear objectives and/or constraints making the analysis challenging. We have proposed multiple forms of algorithms with provable guarantees for these problems. As shown alongside, for a network traffic optimization problem, the proposed Actor-Critic (IFRL AC) and Policy Gradient (IFRL PG) approaches significantly outperform the standard approaches that do not explicitly account for non-linearity. In the Table below, we consider two forms of algorithms with a mention of the results with either concave utility functions or constraints or both. The model-based approaches use posterior or optimistic sampling approaches, the model-free approaches are based on policy gradients or primal-dual approaches. X mentions that we have the results for those cases, and the others represent the scenarios where we are working towards efficient results.


Concave Utility


Concave Utility and Constraints







Model-Free Finite State Space




Model-Free Infinite State Space




Model-Based Approach:

In this approach, we use posterior sampling with Dirichlet distribution, and propose an approach that gives stochastic policy with order-optimal regret guarantees. Further, we use optimism to estimate the MDP, and use efficient approaches to find the regret guarantees when the objective is non-linear and there are constraints. We achieve zero constraint violations with order-optimal objective regret.

Model-Free Finite State Space:

We use randomized primal-dual approach to solve reinforcement learning with concave objective functions and constraints. We achieve zero constraint violations with order-optimal objective regret.

Model-Free Infinite State Space:

We use policy gradient approach with concave objective functions, that allows us to go to large state space. We note that the gradient estimate is biased making the problem challenging.

Q-learning based algorithms with Constraints:

Modification of Q-learning based algorithms with constraints is an important problem. The average constraints make optimal policy non-deterministic making Q-learning directly not applicable while the peak constraint need to be learnt efficiently. In the following works, convergence/regret guarantees are shown for Q-learning based algorithms with peak/average constraints.