q-learning in continuous time
Session 5: Risk Modelling and Machine Learning
Abstract: We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropyregularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term “(little) q-function”. This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a “q-learning” theory around the q-function that is independent of time discretisation. Given a stochastic policy, we jointly characterise the associated q-function and value function by martingale conditions of certain stochastic processes. We then apply the theory to devise various actor–critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly.