In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step. The Baum–Welch algorithm, the primary method for inference in hidden Markov models, is numerically unstable due to its recursive calculation of joint probabilities. As the number of variables grows, these joint probabilities become increasingly small, leading to the forward recursions rapidly approaching values below machine precision. == History == The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s. One of the first major applications of HMMs was to the field of speech processing. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information. They have since become an important tool in the probabilistic modeling of genomic sequences. == Description == A hidden Markov model describes the joint probability of a collection of "hidden" and observed discrete random variables. It relies on the assumption that the i-th hidden variable given the (i − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state. The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. Let X t {\displaystyle X_{t}} be a discrete hidden random variable with N {\displaystyle N} possible values (i.e. We assume there are N {\displaystyle N} states in total). We assume the P ( X t ∣ X t − 1 ) {\displaystyle P(X_{t}\mid X_{t-1})} is independent of time t {\displaystyle t} , which leads to the definition of the time-independent stochastic transition matrix A = { a i j } = P ( X t = j ∣ X t − 1 = i ) . {\displaystyle A=\{a_{ij}\}=P(X_{t}=j\mid X_{t-1}=i).} The initial state distribution (i.e. when t = 1 {\displaystyle t=1} ) is given by π i = P ( X 1 = i ) . {\displaystyle \pi _{i}=P(X_{1}=i).} The observation variables Y t {\displaystyle Y_{t}} can take one of K {\displaystyle K} possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation y i {\displaystyle y_{i}} at time t {\displaystyle t} for state X t = j {\displaystyle X_{t}=j} is given by b j ( y i ) = P ( Y t = y i ∣ X t = j ) . {\displaystyle b_{j}(y_{i})=P(Y_{t}=y_{i}\mid X_{t}=j).} Taking into account all the possible values of Y t {\displaystyle Y_{t}} and X t {\displaystyle X_{t}} , we obtain the N × K {\displaystyle N\times K} matrix B = { b j ( y i ) } {\displaystyle B=\{b_{j}(y_{i})\}} where b j {\displaystyle b_{j}} belongs to all the possible states and y i {\displaystyle y_{i}} belongs to all the observations. An observation sequence is given by Y = ( Y 1 = y 1 , Y 2 = y 2 , … , Y T = y T ) {\displaystyle Y=(Y_{1}=y_{1},Y_{2}=y_{2},\ldots ,Y_{T}=y_{T})} . Thus we can describe a hidden Markov chain by θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} . The Baum–Welch algorithm finds a local maximum for θ ∗ = a r g m a x θ P ( Y ∣ θ ) {\displaystyle \theta ^{}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )} (i.e. the HMM parameters θ {\displaystyle \theta } that maximize the probability of the observation). === Algorithm === Set θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum. ==== Forward procedure ==== Let α i ( t ) = P ( Y 1 = y 1 , … , Y t = y t , X t = i ∣ θ ) {\displaystyle \alpha _{i}(t)=P(Y_{1}=y_{1},\ldots ,Y_{t}=y_{t},X_{t}=i\mid \theta )} , the probability of seeing the observations y 1 , y 2 , … , y t {\displaystyle y_{1},y_{2},\ldots ,y_{t}} and being in state i {\displaystyle i} at time t {\displaystyle t} . This is found recursively: α i ( 1 ) = π i b i ( y 1 ) , {\displaystyle \alpha _{i}(1)=\pi _{i}b_{i}(y_{1}),} α i ( t + 1 ) = b i ( y t + 1 ) ∑ j = 1 N α j ( t ) a j i . {\displaystyle \alpha _{i}(t+1)=b_{i}(y_{t+1})\sum _{j=1}^{N}\alpha _{j}(t)a_{ji}.} Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences. However, this can be avoided in a slightly modified algorithm by scaling α {\displaystyle \alpha } in the forward and β {\displaystyle \beta } in the backward procedure below. ==== Backward procedure ==== Let β i ( t ) = P ( Y t + 1 = y t + 1 , … , Y T = y T ∣ X t = i , θ ) {\displaystyle \beta _{i}(t)=P(Y_{t+1}=y_{t+1},\ldots ,Y_{T}=y_{T}\mid X_{t}=i,\theta )} that is the probability of the ending partial sequence y t + 1 , … , y T {\displaystyle y_{t+1},\ldots ,y_{T}} given starting state i {\displaystyle i} at time t {\displaystyle t} . We calculate β i ( t ) {\displaystyle \beta _{i}(t)} as, β i ( T ) = 1 , {\displaystyle \beta _{i}(T)=1,} β i ( t ) = ∑ j = 1 N β j ( t + 1 ) a i j b j ( y t + 1 ) . {\displaystyle \beta _{i}(t)=\sum _{j=1}^{N}\beta _{j}(t+1)a_{ij}b_{j}(y_{t+1}).} ==== Update ==== We can now calculate the temporary variables, according to Bayes' theorem: γ i ( t ) = P ( X t = i ∣ Y , θ ) = P ( X t = i , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) β i ( t ) ∑ j = 1 N α j ( t ) β j ( t ) , {\displaystyle \gamma _{i}(t)=P(X_{t}=i\mid Y,\theta )={\frac {P(X_{t}=i,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)\beta _{i}(t)}{\sum _{j=1}^{N}\alpha _{j}(t)\beta _{j}(t)}},} which is the probability of being in state i {\displaystyle i} at time t {\displaystyle t} given the observed sequence Y {\displaystyle Y} and the parameters θ {\displaystyle \theta } ξ i j ( t ) = P ( X t = i , X t + 1 = j ∣ Y , θ ) = P ( X t = i , X t + 1 = j , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) a i j β j ( t + 1 ) b j ( y t + 1 ) ∑ k = 1 N ∑ w = 1 N α k ( t ) a k w β w ( t + 1 ) b w ( y t + 1 ) , {\displaystyle \xi _{ij}(t)=P(X_{t}=i,X_{t+1}=j\mid Y,\theta )={\frac {P(X_{t}=i,X_{t+1}=j,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)a_{ij}\beta _{j}(t+1)b_{j}(y_{t+1})}{\sum _{k=1}^{N}\sum _{w=1}^{N}\alpha _{k}(t)a_{kw}\beta _{w}(t+1)b_{w}(y_{t+1})}},} which is the probability of being in state i {\displaystyle i} and j {\displaystyle j} at times t {\displaystyle t} and t + 1 {\displaystyle t+1} respectively given the observed sequence Y {\displaystyle Y} and parameters θ {\displaystyle \theta } . The denominators of γ i ( t ) {\displaystyle \gamma _{i}(t)} and ξ i j ( t ) {\displaystyle \xi _{ij}(t)} are the same ; they represent the probability of making the observation Y {\displaystyle Y} given the parameters θ {\displaystyle \theta } . The parameters of the hidden Markov model θ {\displaystyle \theta } can now be updated: π i ∗ = γ i ( 1 ) , {\displaystyle \pi _{i}^{}=\gamma _{i}(1),} which is the expected frequency spent in state i {\displaystyle i} at time 1 {\displaystyle 1} . a i j ∗ = ∑ t = 1 T − 1 ξ i j ( t ) ∑ t = 1 T − 1 γ i ( t ) , {\displaystyle a_{ij}^{}={\frac {\sum _{t=1}^{T-1}\xi _{ij}(t)}{\sum _{t=1}^{T-1}\gamma _{i}(t)}},} which is the expected number of transitions from state i to state j compared to the expected total number of transitions starting in state i, including from state i to itself. The number of transitions starting in state i is equivalent to the number of times state i is observed in the sequence from t = 1 to t = T − 1. b i ∗ ( v k ) = ∑ t = 1 T 1 y t = v k γ i ( t ) ∑ t = 1 T γ i ( t ) , {\displaystyle b_{i}^{}(v_{k})={\frac {\sum _{t=1}^{T}1_{y_{t}=v_{k}}\gamma _{i}(t)}{\sum _{t=1}^{T}\gamma _{i}(t)}},} where 1 y t = v k = { 1 if y t = v k , 0 otherwise {\displaystyle 1_{y_{t}=v_{k}}={\begin{cases}1&{\text{if }}y_{t}=v_{k},\\0&{\text{otherwise}}\end{cases}}} is an indicator function, and b i ∗ ( v k ) {\displaystyle b_{i}^{}(v_{k})} is the expected number of times the output observations have been equal to v k {\displaystyle v_{k}} while in state i {\displaystyle i} over the expected total number of times in state i {\displaystyle i} . These steps are now repeated iteratively until a desired level of convergence. Note: It is possible to over-fit a particular data set. That is, P ( Y ∣ θ final ) > P ( Y ∣ θ true ) {\displaystyle P(Y\mid \theta _{\text{final}})>P(Y\mid \theta _{\text{true}})} . The algorithm also does not guarantee a global maximum. ==== Multiple sequences ==== The algorithm described thus far assumes a single observed sequence Y = y 1 , … , y T {\displaystyle Y=y_{1},\ldots ,y_{T}} . However, in many situations, there are several sequences observed: Y 1 ,
Read more →