Leave $ Y $ be a random variable with parametric probability density function $ p (y | theta) $ , assume that $ p (y | theta) $ It is concave and differentiable in $ theta in Theta subset mathbb {R} $ and that its partial derivative wrt $ theta $ is Lipschitz continuous with constant $ L> 0 $, me. my.,

$$ left | frac { partial p (y | theta_1)} { partial theta} – frac { partial p (y | theta_2)} { partial theta} right | <L | theta_1 – theta_2 | $$

for all $ theta_1, theta_2 in Theta $.

Suppose we are also given a sequence of observations.$ {Y_t } _ {t in mathbb {Z}} $ since $ Y $.

Then leaving

$$ theta_ {t + 1} = theta_t + alpha nabla p (Y_t | theta_t) quad {t in mathbb {Z}} $$

where $ 0 < alpha le 1 / L $and a beginning $ theta_0 $ is chosen to start the iteration. It is true that

$$ E_ {Y_t} (I (p ( cdot | theta), p ( cdot | theta_t)) – I (p ( cdot | theta), p ( cdot | theta_ {t + 1 })) | Y_ {t-1})> 0 qquad forall t in mathbb {Z} $$

where $ I (p ( cdot | theta), p ( cdot | theta_t)) $ is the Kullback-Leibler divergence notoriously defined as

$$ I (p ( cdot | theta), p ( cdot | theta_t)): = int _ { mathbb {R}} p (y | theta) ln left ( frac {p (y | theta)} {p (y | theta_t)} right) , dy $$

(Also suppose that $ theta_ {t} ne theta $)

The motivation for this statement is that, since we are performing a gradient (stochastic) ascent step in the parameter, so we are approaching it on average to the maximum of the probability, it must follow (from the principle of maximum probability) that The Kullback Leibler divergence at each step will be closer to $ p ( cdot | theta) $, the true density of probability. Is this true?