I’m reading the algorithm on page 12 of An Introduction to Counterfactual Regret Minimization. On lines 25 and 26, we accumulate new values into $r_i$ and $s_i$:

- $25.space space r_I(a) ← r_I(a) + pi_{-i} . (v_{sigma I rightarrow a}(a) – v_{sigma}(a))$
- $26. space space s_I(a) ← s_I(a) + pi_{i} . sigma^t(I, a)$

$r_I(a)$ is the accumulated regret for information set $I$ and action $a$. $s_I(a)$ is the accumulated strategy for information set $I$ and action $a$.

$pi_{i}$ is the probability of reaching this game state for the learning player (for whom we’re updating strategy and regret values in the current CFR iteration). $pi_{-i}$ is the probability of reaching this game state for the other player.

Why do we multiply by $pi_{-i}$ and $pi_{i}$ to accumulate the strategy and regret on lines 25 and 26? Couldn’t we just do this:

- $25.space space r_I(a) ← r_I(a) + (v_{sigma I rightarrow a}(a) – v_{sigma}(a))$
- $26. space space s_I(a) ← s_I(a) + sigma^t(I, a)$

It seems to me it doesn’t matter exactly how much we adjust the strategy and regrets in this CFR iteration—so long as we do enough CFR iterations, won’t we end up with good values for $r_I$ and $s_I$ in the end?