The Mathematics of Hidden Markov Models

Mathematics of Hidden Markov Models: filtering, smoothing, and inference via Expectation-Maximization.

Model Definition

A Hidden Markov Model (HMM) consists of a sequence of hidden states $z_{1:T}$ with $z_t \in \{1,\dots,K\}$ evolving over time, each generating an observation $x_t$ .

p(z_{1:T}, x_{1:T}) = p(z_{1}) \prod_{t=2}^{T} p(z_{t} \mid z_{t-1}) \prod_{t=1}^{T} p(x_{t} \mid z_{t}) \tag{1}

The factorization encodes two key conditional independence assumptions: (1) the evolution of latent states is 1-Markov, and (2) each observation depends only on the contemporaneous latent state.

Graphical structure of a Hidden Markov Model

Figure 1: Graphical structure of a Hidden Markov Model. Latent states $z_{1:T}$ form a Markov chain, and each observation $x_t$ depends only on $z_t$ .

Inference

Since only $x_{1:T}$ is observed, there are two major inference tasks: (1) filtering $p(z_t=k \mid x_{1:t})$ and (2) smoothing $p(z_{t}=k \mid x_{1:T})$ . With smoothing, observations beyond state $z_t$ are used to compute the posterior. From these results, other tasks like future latent states or observations become straightforward. We will first look at filtering.

Filtering: Inferring Current Latent State

Filtering is the task of computing the posterior $p(z_t=k \mid x_{1:t})$ over latent states given observations up to time $t$ . Writing

p(z_{t}=k \mid x_{1:t}) = \frac{p(x_{1:t}, z_{t}=k)}{p(x_{1:t})} \tag{2}

it suffices to track the numerator $p(x_{1:t}, z_{t}=k)$ . Using the conditional independence implied by (1), $x_{t} \perp (z_{t-1}, x_{1:t-1}) \mid z_{t}$ and $z_{t} \perp x_{1:t-1} \mid z_{t-1}$ , we have

\begin{align*} p(x_{1:t}, z_{t}=k) &= \sum_{j=1}^{K} p(z_{t}=k, z_{t-1}=j, x_{1:t}) \\ &= \sum_{j=1}^{K} p(x_{t} \mid z_{t}=k)\, p(z_{t}=k \mid z_{t-1}=j)\, p(z_{t-1}=j, x_{1:t-1}) \\ &= p(x_{t} \mid z_{t}=k) \sum_{j=1}^{K} p(z_{t}=k \mid z_{t-1}=j)\, p(z_{t-1}=j, x_{1:t-1}) \end{align*}

Definition (Forward variable). Define $\alpha_t(k) \equiv p(x_{1:t}, z_t=k)$ for $t \geq 1$ .

Proposition (Forward recursion). The forward variables satisfy, for $t \geq 2$ :

\alpha_t(k) = p(x_t \mid z_t = k) \sum_{j=1}^{K} p(z_t = k \mid z_{t-1} = j)\, \alpha_{t-1}(j)

for $t = 1$ :

\alpha_1(k) = p(x_1 \mid z_1 = k)\, p(z_1 = k)

The marginal likelihood (probability of the data) $p(x_{1:t})$ can be recovered by summing over the latent states $\alpha$ :

p(x_{1:t}) = \sum_{i=1}^{K} p(x_{1:t}, z_{t}=i) = \sum_{i=1}^{K} \alpha_{t}(i) \tag{3}

Which gives us

p(z_{t}=k \mid x_{1:t}) = \frac{p(x_{t} \mid z_{t}=k) \displaystyle\sum_{j=1}^{K} p(z_{t}=k \mid z_{t-1}=j)\, \alpha_{t-1}(j)}{\displaystyle\sum_{i=1}^{K} \alpha_{t}(i)} \tag{4}

Forward filtering message

Figure 2: Forward filtering message. Filtering alternates between prediction through the transition model and correction by the current observation.

Smoothing: Inferring an Intermediate Latent State

In the context of HMMs, smoothing refers to finding $p(z_{t} \mid x_{1:T})$ , making use of the full observation sequence to infer an intermediate latent state. Intuitively, even observations beyond $z_t$ can still carry information about the state of $z_t$ through the Markov chain. Starting with the object of interest

p(z_{t} = i \mid x_{1:T}) = \frac{p(x_{1:T}, z_{t}=i)}{p(x_{1:T})} \tag{5}

Focusing on the numerator:

\begin{align*} p(x_{1:T}, z_{t}=i) &= p(x_{1:t}, z_{t}=i, x_{t+1:T}) \\ &= p(x_{1:t}, z_{t}=i)\, p(x_{t+1:T} \mid z_{t}=i) \end{align*}

which is using the conditional independence due to the HMM graph, $x_{1:t} \perp x_{t+1:T} \mid z_{t}$ :

= \alpha_{t}(i)\, p(x_{t+1:T} \mid z_{t}=i)

Definition (Backward variable). Define $\beta_t(i) \equiv p(x_{t+1:T} \mid z_t = i)$ .

Proposition (Backward recursion). For $1 \leq t \leq T-1$ :

\beta_t(i) = \sum_{j=1}^{K} p(x_{t+1} \mid z_{t+1}=j)\, A_{ij}\, \beta_{t+1}(j)

with terminal condition $\beta_T(i) = 1$ .

Proof.

\begin{align*} \beta_t(i) &= p(x_{t+1:T} \mid z_{t}=i) = \sum_{j=1}^{K} p(x_{t+1:T}, z_{t+1}=j \mid z_{t}=i) \\ &= \sum_{j=1}^{K} p(x_{t+1}, x_{t+2:T}, z_{t+1}=j \mid z_{t}=i) \\ &= \sum_{j=1}^{K} p(x_{t+1}, x_{t+2:T} \mid z_{t+1}=j, z_{t}=i)\, p(z_{t+1}=j \mid z_{t}=i) \end{align*}

and now because $x_{t+2:T} \perp (x_{t+1}, z_{t}) \mid z_{t+1}$ and $x_{t+1} \perp z_t \mid z_{t+1}$ :

\begin{align*} &= \sum_{j=1}^{K} p(x_{t+1} \mid z_{t+1}=j)\, \underbrace{p(x_{t+2:T} \mid z_{t+1}=j)}_{\beta_{t+1}(j)}\, \underbrace{p(z_{t+1}=j \mid z_{t}=i)}_{A_{ij}} \\ &= \sum_{j=1}^{K} p(x_{t+1} \mid z_{t+1}=j)\, A_{ij}\, \beta_{t+1}(j) \end{align*}

and for $t=T$ , for convenience we define $\beta_{T}(j) = 1$ for $j \in \{1, \ldots, K\}$ . $\square$

Definition (Smoother). Define $\gamma_{t}(i) \equiv p(z_{t}=i \mid x_{1:T})$ .

Proposition (Smoothing distribution). Putting together (5), for any $t$ ,

\gamma_t(i) = p(z_t=i \mid x_{1:T}) = \frac{\alpha_t(i)\,\beta_t(i)}{\sum_{j=1}^{K} \alpha_t(j)\,\beta_t(j)}

Parameter Fitting

The model parameters consist of the initial state distribution, the transition distribution between latent states, and the emission distribution associated with each latent state.

That is, the model parameters are:

\pi_k = p(z_1 = k), \qquad A_{jk} = p(z_t = k \mid z_{t-1} = j), \qquad B_k = \text{emission parameters for state } k

Following the method of maximum likelihood estimation, we'd like to maximize the probability of the data given the parameters.

\begin{align*} \hat{\boldsymbol{\theta}} &= \operatorname*{argmax}_{\theta \in \Theta}\ \log p(\mathbf{X} \mid \boldsymbol{\theta}) \\ &= \operatorname*{argmax}_{\theta \in \Theta}\ \log \left( \prod_{n=1}^{N} p(\mathbf{x}_{n} \mid \boldsymbol{\theta}) \right) \\ &= \operatorname*{argmax}_{\theta \in \Theta}\ \log \left( \prod_{n=1}^{N} \sum_{\mathbf{z}_n} p(\mathbf{x}_{n}, \mathbf{z}_{n} \mid \boldsymbol{\theta}) \right) \\ &= \operatorname*{argmax}_{\theta \in \Theta} \sum_{n=1}^{N} \log \sum_{\mathbf{z}_n} p(\mathbf{x}_{n}, \mathbf{z}_{n} \mid \boldsymbol{\theta}) \end{align*}

The sum over latent sequences appears inside the logarithm, which couples the parameters across all possible $\mathbf{z}_n$ . This prevents the objective from decomposing into separable terms, and hence no closed-form maximizer exists.

Expectation-Maximization - E-step

Under EM we want to maximize the expected complete-data log-likelihood under the current posterior:

\begin{align*} Q(\theta, \theta^{\text{old}}) &= \mathbb{E}_{p(z_{1:T} \mid x_{1:T}, \theta^{\text{old}})} \left[ \log p(x_{1:T}, z_{1:T} \mid \theta) \right] \\ &= \mathbb{E}_{p(z_{1:T} \mid x_{1:T}, \theta^{\text{old}})} \Biggl[ \sum_{k=1}^{K} \mathbb{I}(z_{1}=k) \log \pi_{k} \\ &\qquad\qquad + \sum_{t=2}^{T} \sum_{j=1}^{K} \sum_{k=1}^{K} \mathbb{I}(z_{t-1}=j, z_{t}=k) \log A_{jk} \\ &\qquad\qquad + \sum_{t=1}^{T} \sum_{k=1}^{K} \mathbb{I}(z_{t}=k) \log p(x_{t} \mid z_{t}=k; B_{k}) \Biggr] \end{align*}

Moving expectations inside:

\begin{align*} &= \sum_{k=1}^{K} p(z_{1}=k \mid x_{1:T}) \log \pi_{k} \\ &\qquad + \sum_{t=2}^{T} \sum_{j=1}^{K} \sum_{k=1}^{K} p(z_{t-1}=j, z_{t}=k \mid x_{1:T}) \log A_{jk} \\ &\qquad + \sum_{t=1}^{T} \sum_{k=1}^{K} p(z_{t}=k \mid x_{1:T}) \log p(x_{t} \mid z_{t}=k; B_{k}) \\[6pt] &= \sum_{k=1}^{K} \gamma_{1}(k) \log \pi_{k} \\ &\qquad + \sum_{t=2}^{T} \sum_{j=1}^{K} \sum_{k=1}^{K} \underbrace{\xi_{t}(j,k)}_{p(z_{t-1}=j,\, z_t=k \mid x_{1:T})} \log A_{jk} \\ &\qquad + \sum_{t=1}^{T} \sum_{k=1}^{K} \gamma_{t}(k) \log p(x_{t} \mid z_{t}=k; B_{k}) \end{align*}

Definition (Pairwise Smoother). Define $\xi_{t}(j,k) \equiv p(z_{t-1}=j,\, z_t=k \mid x_{1:T})$ .

Writing $\xi_{t}(j,k)$ in terms of objects we already have:

\begin{align*} \xi_{t}(j,k) &= p(z_{t-1}=j, z_{t}=k \mid x_{1:T}) \\ &= \frac{p(z_{t-1}=j, z_{t}=k, x_{1:T})}{p(x_{1:T})} \\ &= \frac{p(z_{t-1}=j, z_{t}=k, x_{1:t-1}, x_{t}, x_{t+1:T})}{p(x_{1:T})} \\ &= \frac{p(x_{1:t-1}, z_{t-1}=j)\, p(z_{t}=k \mid z_{t-1}=j)\, p(x_{t} \mid z_{t}=k)\, p(x_{t+1:T} \mid z_{t}=k)}{p(x_{1:T})} \\ &= \frac{\alpha_{t-1}(j)\, A_{jk}\, p(x_{t} \mid z_{t}=k)\, \beta_{t}(k)}{\sum_{i=1}^{K} \alpha_{t}(i)\, \beta_{t}(i)} \end{align*}

To summarize:

Q(\theta, \theta^{\text{old}}) = \underbrace{\sum_{k} \gamma_1(k) \log \pi_k}_{Q_{\pi}} + \underbrace{\sum_{t=2}^{T} \sum_{j,k} \xi_t(j,k) \log A_{jk}}_{Q_{A}} + \underbrace{\sum_{t=1}^{T} \sum_{k} \gamma_t(k) \log p(x_t \mid z_t=k; B_k)}_{Q_{B}}

We have now computed the expectation of the log-likelihood under the posterior of the latent variables.

Expectation Maximization - M-step

Now, we need to update the parameters of our model $\theta = (\pi, A, B)$ . Since $Q$ is the sum of three terms of distinct parameters, we can maximize each term independently. Starting with maximizing the first term $Q_{\pi}$ :

Initial distribution.

Q_{\pi} = \sum_{k} \gamma_1(k) \log \pi_k \quad \text{subject to } \sum_{k} \pi_{k} = 1

Setting up the Lagrangian:

\begin{align*} \mathcal{L}(\pi, \lambda) &= \sum_{k=1}^{K} \gamma_{1}(k) \log \pi_{k} + \lambda \left( \sum_{k=1}^{K} \pi_{k} - 1 \right) \\ \frac{\partial \mathcal{L}}{\partial \pi_{k}} &= \frac{\gamma_{1}(k)}{\pi_{k}} + \lambda = 0 \\ \pi_{k} &= \frac{-\gamma_{1}(k)}{\lambda} \end{align*}

Now using the constraint:

\sum_{k=1}^{K} \pi_{k} = 1 \implies \sum_{k} \frac{-\gamma_{1}(k)}{\lambda} = 1

and because $\gamma_{1}(k) \equiv p(z_{1}=k \mid x_{1:T})$ the sum across all latent states $\sum_{k} p(z_{1}=k \mid x_{1:T}) = 1$ and hence

\lambda = -1 \implies \boxed{\pi_{k}^{\text{new}} = \gamma_{1}(k)}

Now we take a brief detour to prove a classic result which can be reused.

Lemma. The solution to $\max_p \sum_k w_k \log p_k$ subject to $\sum_k p_k = 1$ , $p_k \geq 0$ , is

p_k = \frac{w_k}{\sum_j w_j}

Proof. From the Lagrangian $\mathcal{L}(p, \lambda) = \sum_{k} w_{k} \log p_{k} + \lambda (\sum_{k} p_{k} - 1)$ , take the derivative w.r.t. $p_{k}$ : $\frac{\partial \mathcal{L}}{\partial p_{k}} = \frac{w_{k}}{p_{k}} + \lambda = 0$ . Solving: $\frac{w_{k}}{p_{k}} = -\lambda \implies p_{k} = \frac{w_{k}}{-\lambda} \implies -\lambda = \sum_{k} w_{k}$ . The constraint $\sum_k p_k = 1$ then forces $p_{k} = \frac{w_{k}}{\sum_{j} w_{j}}$ . $\square$

Transition matrix. The result applies to $Q_{A}$ :

\begin{align*} Q_{A} &= \sum_{t=2}^{T} \sum_{j,k} \xi_t(j,k) \log A_{jk} \\ &= \sum_{j} \sum_{k} \sum_{t=2}^{T} \xi_t(j,k) \log A_{jk} \end{align*}

Considering a particular $j$ we'd like to maximize

\sum_{k} \sum_{t=2}^{T} \xi_t(j,k) \log A_{jk} \quad \text{s.t. } \sum_{k=1}^{K} A_{jk} = 1

Making use of the Lemma, we see this is maximized by

A_{jk} = \frac{\sum_{t=2}^{T} \xi_{t}(j,k)}{\sum_{i=1}^{K}\sum_{t=2}^{T} \xi_{t}(j,i)}

and because

\begin{align*} \sum_{i=1}^{K}\sum_{t=2}^{T} \xi_{t}(j,i) &= \sum_{t=2}^{T} \sum_{i=1}^{K} \xi_{t}(j,i) \\ &= \sum_{t=2}^{T} \sum_{i=1}^{K} p(z_{t}=i, z_{t-1}=j \mid x_{1:T}) \\ &= \sum_{t=2}^{T} p(z_{t-1}=j \mid x_{1:T}) \\ &= \sum_{t=2}^{T} \gamma_{t-1}(j) \end{align*}

Therefore, substituting this result into the denominator:

\boxed{A_{jk}^{\text{new}} = \frac{\sum_{t=2}^{T} \xi_{t}(j,k)}{\sum_{t=2}^{T} \gamma_{t-1}(j)}} \tag{6}

Remark. The updated transition parameters become the normalized estimated/"soft" (from the E-step) counts of transitions from $j \to k$ .

The final term to deal with is the emission parameters. The optimization depends on whether we choose emission to be categorical, Gaussian, or some other distribution. For a categorical distribution, we can follow a similar approach using the general result as we did with $Q_{A}$ and $Q_{\pi}$ .

Let us also show the derivation for Gaussian emissions.

Emission parameters (Gaussian).

\begin{align*} Q_{B} &= \sum_{t=1}^{T} \sum_{k} \gamma_{t}(k) \log p(x_{t} \mid z_{t}=k; B_{k}) \\ &= \sum_{t=1}^{T} \sum_{k} \gamma_{t}(k) \left[ -\frac{(x_t - \mu_k)^2}{2\sigma_k^2} - \log\left(\sigma_k \sqrt{2\pi}\right) \right] \\ &= \sum_{k} \sum_{t=1}^{T} \gamma_{t}(k) \left[ -\frac{(x_t - \mu_k)^2}{2\sigma_k^2} - \log\left(\sigma_k \sqrt{2\pi}\right) \right] \end{align*}

Let's start by finding $\mu_{k}$ :

\begin{align*} \frac{\partial Q_{B}}{\partial \mu_{k}} &= \sum_{t=1}^{T} \gamma_{t}(k) \left[ \frac{x_{t} - \mu_{k}}{\sigma_{k}^{2}} \right] = 0 \\ &\implies \sum_{t=1}^{T} \gamma_{t}(k) \left( x_{t} - \mu_{k}\right) = 0 \\ &\implies \boxed{\mu_{k}^{\text{new}} = \frac{\sum_{t=1}^{T} \gamma_{t}(k) x_{t}}{\sum_{t=1}^{T} \gamma_{t}(k)}} \tag{7} \end{align*}

Now, solving for the variance term:

\begin{align*} \frac{\partial Q_{B}}{\partial \sigma_{k}} &= \sum_{t=1}^{T} \gamma_{t}(k) \left[ \frac{(x_{t} - \mu_{k})^{2}}{\sigma_{k}^{3}} - \frac{1}{\sigma_{k}}\right] = 0 \\ &\implies \sigma_{k}^{2} \sum_{t=1}^{T} \gamma_{t}(k) = \sum_{t=1}^{T} \gamma_{t}(k) (x_{t} - \mu_{k})^{2} \\ &\implies \boxed{(\sigma_{k}^{2})^{\text{new}} = \frac{\sum_{t=1}^{T} \gamma_{t}(k) (x_{t} - \mu_{k}^{\text{new}})^{2}}{\sum_{t=1}^{T} \gamma_{t}(k)}} \tag{8} \end{align*}

The updates (7) and (8) are the standard Gaussian MLE estimates, except each observation is weighted by the posterior probability that it was generated by state $k$ .

EM Algorithm Summary (Vectorized)

Given current parameters $\theta^{\text{old}} = (\pi, A, B)$ , define the emission likelihood vector

\mathbf{b}_t = \begin{bmatrix} p(x_t \mid z_t=1; B_1) \\ \vdots \\ p(x_t \mid z_t=K; B_K) \end{bmatrix}

Let $\odot$ denote elementwise multiplication.

E-step. Compute the forward vectors $\bm{\alpha}_t$ and backward vectors $\bm{\beta}_t$ using

\bm{\alpha}_1 = \mathbf{b}_1 \odot \bm{\pi}, \qquad \bm{\alpha}_t = \mathbf{b}_t \odot \left(\mathbf{A}^{\top} \bm{\alpha}_{t-1}\right), \quad t=2,\dots,T \tag{9}

and

\bm{\beta}_T = \mathbf{1}, \qquad \bm{\beta}_t = \mathbf{A}\left(\mathbf{b}_{t+1} \odot \bm{\beta}_{t+1}\right), \quad t=T-1,\dots,1

Then compute the smoothing probabilities

\bm{\gamma}_t = \frac{\bm{\alpha}_t \odot \bm{\beta}_t}{\bm{\alpha}_t^{\top} \bm{\beta}_t}

For $t=2,\dots,T$ , compute the pairwise smoothing probabilities in matrix form

\bm{\Xi}_t = \frac{ \operatorname{diag}(\bm{\alpha}_{t-1})\, \mathbf{A}\, \operatorname{diag}(\mathbf{b}_t \odot \bm{\beta}_t) } { \bm{\alpha}_t^{\top}\bm{\beta}_t } \in \mathbb{R}^{K \times K}

where $(\bm{\Xi}_t)_{jk} = \xi_t(j,k)$ .

M-step. Update the parameters:

\bm{\pi}^{\text{new}} = \bm{\gamma}_1

Then define the expected transition count matrix $\mathbf{N} = \sum_{t=2}^{T} \bm{\Xi}_t$ , in which each row $j$ encodes the number of expected transitions out of $j$ . Then the transition is the row update

\mathbf{A}^{\text{new}} = \text{row-normalize}\!\left(\sum_{t=2}^{T} \bm{\Xi}_t\right)

Then for the Gaussian parameters, defining $\mathbf{x} \in \mathbb{R}^{T}$ as the vector of observations and $\Gamma \in \mathbb{R}^{T \times K}$ as $\bm{\gamma}_t$ stacked row-wise:

\bm{\mu}^{\text{new}} = \frac{\Gamma^\top \mathbf{x}}{\Gamma^\top \mathbf{1}}

Define a residual matrix $\mathbf{R} \in \mathbb{R}^{T \times K}$ by

R_{tk} = x_t - \mu_k^{\text{new}}

Then the variance update is

(\bm{\sigma}^{2})^{\text{new}} = \frac{(\Gamma \odot \mathbf{R}^{\odot 2})^\top \mathbf{1}}{\Gamma^\top \mathbf{1}}

where $\mathbf{R}^{\odot 2}$ denotes elementwise squaring and division is elementwise.