Motif detection using a Hidden Markov Model

Introduction

This notes started as an adaptation of Rabiner’s tutorial to Profile HMM.

Problem description

We have a sequence of length \(n\) over some alphabet \(\pmb{A}\). This sequence is generated by a random process, so the sequence is a random variable \(\pmb{X}\):

\[\pmb{X} = X_1X_2...X_n\]

As said before the range of each \(X_i\) is the alphabet:

\[Val(X_i) = \pmb{A}\]

We will call each element of \(\pmb{A}\) a symbol. The size of the alphabet is \(|\pmb{A}|\).

We assume that each \(X_i\) can be part either of some repeating motif of length \(W\) or be just background noise. When considering repeating motifs we allow to be gaps and deletions on the motif. For example consider the following sequences made of the letters A,B,C,D,E (spaces added to aid the reader):

  • Repeating motif

    ABCDE ABCDE ABCDE ABCDE ABCDE

  • With some baground noise between motif repetitions

    ABCDE AB ABCDE BB ABCDE C ABCDE CCCC ABCDE

  • With deletions

    ABDE AB ABCDE BB BCDE C ABCDE CCCC ABDE

  • With gaps

    ABAACDE ABCDE A ACCCBCDE C ABCDE ABCE

Model

To account for background noise, deletions and gaps we model the generation of the sequence with a Profile Hidden Markov Model (PHMM). In a PHMM the hidden process generating the sequence changes between states according to the following state diagram, where each node represents an state and the process jumps randomly from one state to another state following the arrows:

There are three types of states:

  • Background states

    \[\pmb{S}^B=\left\{b_1,...,b_W\right\}\]
  • Motif states

    \[\pmb{S}^M=\left\{m_1,...,m_W\right\}\]
  • Delete states

    \[\pmb{S}^D=\left\{d_1,...,d_W\right\}\]

When the process jumps to a background state it emits a letter of the sequence according to the background distribution. We assume that all background states follow the same categorical distribution where each symbol \(a \in \pmb{A}\) is emitted with probability \(f^B_a\).

When the process jumps to a motif state it emits a letter of the sequence according to the motif distribution. We assume that the emission probability depends on which motif state we are in, and so if we are in state \(m_j\) each symbol \(a\) is emitted with probability \(f^M_{ja}\).

When the process jumps to a deletion state it emits nothing. These are silent states introduced only for the purpose of allowing jumps between motifs \(m_j\) and \(m_k\) for \(k>j+1\).

Notice that the state diagram has loops. This is because it models the motif, which is allowed to repeat.

The above diagram is incomplete if we don’t specify the probabilites of state transitions. The following table gives them for direct transitions between states:

  To state
\(b_j\) \(b_{j+1}\) \(m_{j+1}\) \(d_{j+1}\)
From state \(b_j\) b(j) 0 1 - b(j) 0
\(m_j\) 0 \(m_b\) \(m\) \(m_d\)
\(d_j\) 0 0 1 - d d

To avoid introducing too much parameters we have assumed that the transition probabilities are stationary, the only exception being:

\[b(j) = b_{out}[j=W] + b_{in}[j\neq W]\]

Now let’s define the random variable \(Z_i\) as the state of the process when emitting symbol \(X_i\). Then the range of \(Z_i\) is:

\[\pmb{S} = Val(Z_i) = \pmb{S}^B \cup \pmb{S}^M\]

With these variables we have the familiar HMM. Notice that although similar to the graph of the PHMM this one has different meaning. The former was an state diagram and this one is a Bayesian network. Each node is now a random variable instead of a concrete state:

We are capable of computing the transition probabilities between the emitting states \(Z_i\). For this we must consider any possible path that connects the two states, compute the probability of each path and sum all the probabilities. To make this more concrete consider the case of motif width \(W=6\), where we start from state \(m_2\):

We can see the pattern in the above formulas. If we consider the path going from \(m_j\) to \(m_k\) the number of delete states we must cross is:

\[L_{jk} = (k - j - 2) \mod W\]

Then the probability of the path is:

\[m_dd^{L_{jk}}(1-d) + [k=j+1]m\]

Where \([k=j+1]\) is the Iverson bracket, which takes the value 1 only when the condition inside the brackets is true, and zero otherwise.

Notice that given the value of \(j\) and \(L_{jk}\) we can recover \(k\) as:

\[k = 1 + (j + L_{jk} + 1) \mod W\]

To compute the final probability we need to take into account the cases where we make several loops over all the delete states. Each loop has probability \(d^W\), and \(l\) loops then \((d^W)^l\). The following figure shows the possible paths between two states depending on the number of loops:

The final probability is then:

\[P\left(Z_{i+1}=m_k|Z_i=m_j\right) = m_dd^{L_{jk}}\left(\sum_{l=0}^\infty(d^W)^l\right)(1-d) + [k=j+1]m\]

Making the summation we get we get the final transition probabilities when starting from state \(m_j\):

\[\begin{split}P\left(Z_{i+1}=b_{j+1}|Z_i=m_j\right) &= m_b \\ P\left(Z_{i+1}=m_k|Z_i=m_j\right) &= m_dF(L_{jk}, d) + [k=j+1]m \\ F(l, d) &= d^l\frac{1 - d}{1 - d^W}\end{split}\]

The behavior of \(F(l, d)\) near the boundaries is:

\[\begin{split}\underset{d \to 0}{\lim} F &= [l=0] \\ \underset{d \to 1}{\lim} F &= 1/W\end{split}\]

The following figure shows a plot for \(W=6\):

We can see in the above graph that lower values of \(d\) concentrate the probabilities on the state two steps ahead, and higher values spread the probabilities more evenly on all the states.

The transitions starting from a background states are trivial:

\[\begin{split}P\left(Z_{i+1}=b_j|Z_i=b_j\right) &= b(j) \\ P\left(Z_{i+1}=m_j|Z_i=b_j\right) &= 1 - b(j)\end{split}\]

The emission probabilities:

\[\begin{split}P\left(X_i=a|Z_i=b_j\right) &= f^B_a \\ P\left(X_i=a|Z_i=m_j\right) &= f^M_{ja}\end{split}\]

Manual annotations

We extend the model with a random variable \(U_i\) for each \(Z_i\):

These random variables are direct observations on the state variables that flag whether or not a hidden state is part of a motif:

\[U_i = [Z_i \in \pmb{S}^m]\]

We can use this to aid in detecting motifs. For example, a user can manually annotate where motifs are by specifying the value of \(U_i\).

Parameter estimation

Expectation-Maximization (EM)

For convenience let’s aggregate all the model parameters into a single vector:

\[\begin{split}\pmb{\theta} &= \left(\pmb{t}, \pmb{f}^B, \pmb{f}^M\right) \\ \pmb{t} &= \left(\pmb{b}, \pmb{t}^M\right) \\ \pmb{b} &= (b_{in}, b_{out}) \\ \pmb{t}^M &= \left(d, m_d, m, m_b\right)\end{split}\]

Using EM each step takes the form, where \(\pmb{\theta}^0\) and \(\pmb{\theta}^1\) are the current and next estimates of the parameters respectively:

\[\begin{split}\pmb{\theta}^1 &= \underset{\pmb{\theta}}{\arg\max} Q(\pmb{\theta}|\pmb{\theta}^0) \\ Q(\pmb{\theta}|\pmb{\theta}^0) &= \log P(\pmb{\theta}) + E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right]\end{split}\]

\(\pmb{x}^D\) is the training data, a particular realization of \(\pmb{X}\).

\(P(\pmb{\theta})\) are the prior probabilities on the parameters. We are going to consider priors only on \(\pmb{f}^M\).

EM on a HMM

Taking into account that in a HMM the joint probability distribution factors as:

\[P(\pmb{X}, \pmb{Z}|\pmb{\theta}) = \prod_{i=1}^nP(Z_i|Z_{i-1}, \pmb{\theta})P(X_i|Z_i, \pmb{\theta})\]

We expand the expectation in the EM step as:

\[E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] = \sum_{\pmb{z}}P(\pmb{z}|\pmb{x}^D,\pmb{\theta}^0)\sum_{i=1}^n \left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right]\]

Interchanging the summations:

\[E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] = \sum_{i=1}^n\sum_{\pmb{z}}P(\pmb{z}|X,\pmb{\theta}^0) \left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right]\]

Defining the set \(\pmb{C}_i=\left\{Z_{i-1}, Z_i\right\}\) we can always do:

\[P(\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0) = P(\pmb{Z} - \pmb{C}_i|\pmb{C}_i, \pmb{x}^D, \pmb{\theta}^0)P(\pmb{C}_i|\pmb{x}^D,\pmb{\theta}^0)\]

Now the summation over \(\pmb{Z}\) can be decomposed as:

\[E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] = \sum_{i=1}^n\left(\sum_{\pmb{z} - \pmb{c}_i}P(\pmb{z} - \pmb{c}_i|\pmb{x}^D,\pmb{\theta}^0)\right) \sum_{\pmb{c}_i}P(\pmb{c_i}|\pmb{x}^D, \pmb{\theta}^0)\left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right]\]

Of course:

\[\sum_{\pmb{z} - \pmb{c}_i}P(\pmb{z} - \pmb{c}_i|\pmb{x}^D,\pmb{\theta}^0) = 1\]

And so we finally get:

\[E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] = \sum_{i=1}^n \sum_{z_{i-1},z_i}P(z_{i-1},z_i|\pmb{x}^D, \pmb{\theta}^0)\left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right]\]

Computing \(P(Z_{i-1}, Z_i|\pmb{X},\pmb{\theta})\)

To unclutter a little let’s call:

\[\begin{split}\xi_i(s_1, s_2) &= P(Z_{i-1}=s_1, Z_i=s_2|\pmb{x}^D, \pmb{\theta}^0) \\ \gamma_i(s) &= P(Z_i=s|\pmb{x}^D, \pmb{\theta}^0)\end{split}\]

Notice that:

\[\gamma_i(s) = \sum_{s_1} \xi_i(s_1, s)\]

And so we just need to compute \(\xi_i\). To compute it we first apply Bayes theorem:

\[P(Z_{i-1}, Z_i|\pmb{X},\pmb{\theta}^0) = \frac{P(\pmb{X}|Z_{i-1},Z_i,\pmb{\theta}^0)P(Z_{i-1}, Z_i|\pmb{\theta}^0)}{P(\pmb{X}|\pmb{\theta}^0)}\]

Now, thanks to the Markov structure of the probabilities we can factor things:

\[\begin{split}P(\pmb{X}|Z_{i-1},Z_i,\pmb{\theta}^0) &= P(X_1...X_{i-1}|Z_{i-1},\pmb{\theta}^0)P(X_i...X_n|Z_i,\pmb{\theta}^0) \\ P(Z_{i-1},Z_i|\pmb{\theta}^0) &= P(Z_{i-1}|\pmb{\theta}^0)P(Z_i|Z_{i-1},\pmb{\theta}^0)\end{split}\]

Renaming things a bit again:

\[\begin{split}\alpha_i(s) &= P(X_1...X_i,Z_i=s|\pmb{\theta}^0) \\ \beta_i(s) &= P(X_i...X_n|Z_i=s,\pmb{\theta}^0)\end{split}\]

We get that:

\[\begin{split}\tilde{\xi}_i(s_1, s_2) &= \alpha_{i-1}(s_1)\beta_i(s_2)P(Z_i=s_2|Z_{i-1}=s_1,\pmb{\theta}^0) \\ \xi_i(s_1, s_2) &= \frac{\tilde{\xi}_i(s_1, s_2)}{\sum_{s_1,s_2}\tilde{\xi}_i(s_1, s_2)}\end{split}\]

We compute the first values of \(\alpha\) to see the pattern:

In general:

\[\begin{split}\alpha_1(s) &= P(Z_1=s)P(X_1|Z_1=s) \\ \alpha_i(s_2) &= \left[\sum_{s_1}\alpha_{i-1}(s_1)P(Z_i=s_2|Z_{i-1}=s_1)\right]P(X_i|Z_i=s_2)\end{split}\]

Following the same process but starting from the end of the HMM we get:

\[\begin{split}\beta_n(s) &= P(X_n|Z_n=s) \\ \beta_{i-1}(s_1) &= \left[\sum_{s_2}\beta_i(s_2)P(Z_i=s_2|Z_{i-1}=s_1)\right]P(X_{i-1}|Z_{i-1}=s_1)\end{split}\]

We can take into account in this step any information the user has provided about the states. If we know that:

\[\begin{split}U_i = 0 &\implies \underset{s \in \pmb{S}^M}{\alpha (s)} = 0 \\ U_i = 1 &\implies \underset{s \in \pmb{S}^B}{\alpha (s)} = 0\end{split}\]

And of course renormalize:

\[\alpha(s) = \frac{\alpha(s)}{\sum_s \alpha(s)}\]

And also we do the same with \(\beta\).

Re-estimation equations

We now separate the expectation in two independent parts:

\[\begin{split}E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] &= E^1(\pmb{f}^M, \pmb{f}^B) + E^2(\pmb{t}) \\ E^1(\pmb{f}^M, \pmb{f}^B) &= \sum_{i=1}^n \sum_{s \in \pmb{S}}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^M, \pmb{f}^B) \\ E^2(\pmb{t}) &= \sum_{i=1}^n \sum_{s_1,s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t})\end{split}\]

We can continue splitting into independent parts:

\[\begin{split}E^1(\pmb{f}^M, \pmb{f}^B) &= E^{1B}(\pmb{f}^B) + E^{1M}(\pmb{f}^M) \\ E^{1B}(\pmb{f}^B) &= \sum_{i=1}^n \sum_{s \in \pmb{S}^B}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^B) \\ E^{1M}(\pmb{f}^M) &= \sum_{i=1}^n \sum_{s \in \pmb{S}^M}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^M)\end{split}\]

And also:

\[\begin{split}E^2(\pmb{t}) &= E^{2B}(b) + E^{2M}(\pmb{t}^M) \\ E^{2B}(\pmb{b}) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^B}\sum_{s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{b}) \\ E^{2M}(\pmb{t}^M) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^M}\sum_{s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t}^M)\end{split}\]

We have now 4 independent maximization problems:

\[\begin{split}\left(\pmb{f}^B\right)^1 &= \underset{\pmb{f}^B}{\arg \max}E^{1B}\\ \left(\pmb{f}^M\right)^1& = \underset{\pmb{f}^M}{\arg \max}\left\{\log P(\pmb{f}^M) + E^{1M}\right\} \\ \pmb{b}^1 &= \underset{\pmb{b}}{\arg \max} E^{2B} \\ \left(\pmb{t}^M\right)^1 &= \underset{\pmb{t}^M}{\arg \max} E^{2M}\end{split}\]

Estimation of \(\pmb{f}^B\)

We need to solve:

\[\left(\pmb{f}^B\right)^1 = \underset{\pmb{f}^B}{\arg \max}E^{1B}\]

With the constraint:

\[\begin{split}g_B &= \sum_{a \in \pmb{A}} f^B_a - 1 = 0\end{split}\]

We will enforce the constraints using Lagrange multipliers, and taking derivatives:

\[\frac{\partial}{\partial \pmb{f}^B,\lambda_B}\left\{E^{1B} - \lambda_Bg_B\right\} = 0\]

From this we get the closed form solution:

\[\begin{split}\tilde{f}^B_a &= \sum_{i=1}^n[x_i^D=a]\sum_{j=1}^W \gamma_i(b_j) \\ f^B_a &= \frac{\tilde{f}^B_a}{\sum_{a \in \pmb{A}}\tilde{f}^B_a}\end{split}\]

Estimation of \(\pmb{f}^M\)

We need to solve:

\[\begin{split}\left(\pmb{f}^M\right)^1& = \underset{\pmb{f}^M}{\arg \max}\left\{\log P(\pmb{f}^M) + E^{1M}\right\}\end{split}\]

With the constraint:

\[\begin{split}g_j &= \sum_{a \in \pmb{A}} f^M_{ja} - 1 = 0\end{split}\]

We will enforce the constraints using Lagrange multipliers, and taking derivatives:

\[\frac{\partial}{\partial \pmb{f}^M,\lambda_j}\left\{\log P(\pmb{f}^M) + E^{1M} - \lambda_jg_j\right\} = 0\]

If we use Dirichlet distribution over the symbols of each \(\pmb{f}^M_j\). The log-probability of the prior is then:

\[\log P(\pmb{f}^M_j|\pmb{\varepsilon}) = -\log B(\pmb{\varepsilon}) + \sum_{s \in \pmb{A}}(\varepsilon_s - 1)\log \pmb{f}^M_j\]

And the derivative:

\[\frac{\partial}{\partial f^M_{ja}}\log P(\pmb{f}^M_j|\pmb{\varepsilon}) = \frac{\varepsilon_a - 1}{f^M_{ja}}\]

From this we get the closed form solution:

\[\begin{split}\tilde{f}^M_{ja} &= \sum_{i=1}^n[x_i^D=a]\gamma_i(m_j) \\ f^M_{ja} &= \frac{ \varepsilon_a - 1 + \tilde{f}^M_{ja}}{\varepsilon_0 - 1 + \sum_{a \in \pmb{A}}\tilde{f}^M_{ja}}\end{split}\]

Where \(\varepsilon_0\) is called the concentration parameter and it’s:

\[\varepsilon_0 = \sum_{a \in \pmb{A}}\varepsilon_a\]

Estimation of \(b\)

We need to solve:

\[\pmb{b}^1 = \underset{\pmb{b}}{\arg \max} E^{2B}\]

Taking derivatives we get:

\[\begin{split}b_{in}^1 &= \frac{\sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, b_j)}{ \sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, b_j) + \sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, m_j)} \\ b_{out}^1 &= \frac{\sum_{i=1}^n\xi_i(b_1, b_1)}{ \sum_{i=1}^n \xi_i(b_1, b_1) + \sum_{i=1}^n \xi_i(b_1, m_1)}\end{split}\]

Estimation of \(\pmb{t}^M\)

We need to solve:

\[\begin{split}\left(\pmb{t}^M\right)^1 &= \underset{\pmb{t}^M}{\arg \max} E^{2M}\end{split}\]

Subject to the constraint:

\[g_m = m_b + m + m_d - 1 = 0\]

Trying to solve the above problem using Lagrange multipliers gives as a result a set of highly non-linear equations in \(d\), and since the rest of the parameters are coupled to \(d\) either directly or trough the constraint it is not possible to find a closed form solution. Because of this we throw away the Lagrange multipliers and solve the maximization problem numerically.

Let’s recap the shape of our problem:

\[\begin{split}E^{2M}(\pmb{t}^M) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^M}\sum_{s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t}^M) \\\end{split}\]
\[\begin{split}P\left(Z_i=b_{j+1}|Z_{i-1}=m_j\right) &= m_b \\ P\left(Z_i=m_k|Z_{i-1}=m_j\right) &= m_dF(L_{jk}, d) + [k=j+1]m \\ F(l, d) &= d^l\frac{1 - d}{1 - d^W} \\ L_{jk} &= (k - j - 2) \mod W \\ K_{jl} &= 1 + \left[(j + l + 1) \mod W\right]\end{split}\]

We rewrite the summation as:

\[\begin{split}E^{2M}(\pmb{t}^M) &= \sum_{i=1}^n\sum_{j=1}^W \left\{ \xi_i(m_j, b_{j+1})\log P(b_{j+1}|m_j, \pmb{t}^M) + \sum_{l=0}^{W-1}\xi_i(m_j,m_{K_{jl}})\log P(m_{K_{jl}}|m_j, \pmb{t}^M) \right\} \\ &= \sum_{i=1}^n\sum_{j=1}^W \left\{ \xi_i(m_j, b_{j+1})\log m_b + \sum_{l=0}^{W-1}\xi_i(m_j,m_{K_{jl}}) \log \left[m_dF(l, d) + [l=W-1]m \right] \right\} \\ &= \left(\sum_{i=1}^n\sum_{j=1}^W \xi_i(m_j, b_{j+1})\right)\log m_b + \\ & \quad \sum_{l=0}^{W-1}\left(\sum_{i=1}^n\sum_{j=1}^W\xi_i(m_j, m_{K_{jl}})\right)\log \left[m_d F(l,d) + [l=W-1]m\right]\end{split}\]

Since we are going to solve the problem numerically it is possible that we reach a local maximum. As long as we improve the starting point it is enough.

Time Complexity

The time complexity of the parameter estimation problem is dominated by the estimation of \(\alpha\), \(\beta\) and \(\xi\). We have a triple loops of this form:

for i in range(n):
    for s in range(2*W):
        for t in range(2*W):
            xi[s, t, i] = alpha[s, i - 1]*beta[t, i]*pT[s, t]

In the above code the math symbols have been rewriten as numpy arrays:

Math Python
\(\xi_i(s, t)\) xi[s, t, i]
\(\alpha_i(s)\) alpha[s, i]
\(\beta_i(s)\) beta[s, i]
\(P(Z_{i+1}=t|Z_i=s)\) pT[s, t]

Time complexity is therefore \(O(nW^2)\). We can speed up the above triple loop by noticing that a lot of entries in the pT matrix are zeros:

Taking only the non-zero entries into account we could move complexity from \(4nW^2\) to \(nW(W+3)\), which would divide time almost by a factor of 4.

The reason that the lower right part of the matrix, that is the transition probabilities between motif states, is full is that the possibility of deleting states of the motif means that we can jump from any motif state to any other motif state. If on the other hand we didn’t allow deletions the state transition diagram would look as:

Transition probabilities from the background states would remain unchanged and transition probabilities from the motif states would simply be:

\[\begin{split}P\left(Z_{i+1}=b_{j+1}|Z_i=m_j\right) &= m_b \\ P\left(Z_{i+1}=m_{j+1}|Z_i=m_j\right) &= m\end{split}\]

Now the matrix of transition probabilities has the following non-zero entries:

And time complexity is just \(4nW\). This simplified model is actually a very good approximation of the more complex one. The following plot shows how the values of the full model are distributed inside the transition matrix for typical parameters:

To see how fast probabilities decay away from the diagonal consider the following plot showing the values around state 60 (motif state 20):

We can make a compromise between the fully diagonal model and a model where only \(w\) elements away from the diagonal are retained. The transition matrix would look like:

Time complexity would be then \(nW(3 + w)\). Let’s call \(\tilde{P}(\pmb{X}, \pmb{Z})\) the probability distribution of the sequence under this new approximate transition matrix. We can get a modified EM algorithm:

\[\begin{split}\log \left[\sum_{\pmb{Z}} P(\pmb{X}, \pmb{Z})\right] &= \log \left[\sum_{\pmb{Z}} \tilde{P}(\pmb{X}, \pmb{Z}) \frac{P(\pmb{X}, \pmb{Z})}{\tilde{P}(\pmb{X}, \pmb{Z})}\right] \\ & \geq \sum_{\pmb{Z}}\tilde{P}(\pmb{X}, \pmb{Z})\log\frac{P(\pmb{X}, \pmb{Z})}{\tilde{P}(\pmb{X}, \pmb{Z})}\end{split}\]

The EM algorithm proceeds as usual but with a modified function to maximize:

\[Q(\pmb{\theta}|\pmb{\theta}^0) = \log P(\pmb{\theta}) + E_{\tilde{P}\left(\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0\right)}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right]\]

The only modification in the forward-backward algorithm is that there are more zeros now on the transition matrix. We get halfway to a variational approach like the one in Factorial hidden Markov models.