Research Article  Open Access
Francesco Cartella, Jan Lemeire, Luca Dimiccoli, Hichem Sahli, "Hidden SemiMarkov Models for Predictive Maintenance", Mathematical Problems in Engineering, vol. 2015, Article ID 278120, 23 pages, 2015. https://doi.org/10.1155/2015/278120
Hidden SemiMarkov Models for Predictive Maintenance
Abstract
Realistic predictive maintenance approaches are essential for condition monitoring and predictive maintenance of industrial machines. In this work, we propose Hidden SemiMarkov Models (HSMMs) with (i) no constraints on the state duration density function and (ii) being applied to continuous or discrete observation. To deal with such a type of HSMM, we also propose modifications to the learning, inference, and prediction algorithms. Finally, automatic model selection has been made possible using the Akaike Information Criterion. This paper describes the theoretical formalization of the model as well as several experiments performed on simulated and real data with the aim of methodology validation. In all performed experiments, the model is able to correctly estimate the current state and to effectively predict the time to a predefined event with a low overall average absolute error. As a consequence, its applicability to real world settings can be beneficial, especially where in real time the Remaining Useful Lifetime (RUL) of the machine is calculated.
1. Introduction
Predictive models that are able to estimate the current condition and the Remaining Useful Lifetime of an industrial equipment are of high interest, especially for manufacturing companies, which can optimize their maintenance strategies. If we consider that the costs derived from maintenance are one of the largest parts of the operational costs [1] and that often the maintenance and operations departments comprise about 30% of the manpower [2, 3], it is not difficult to estimate the economic advantages that such innovative techniques can bring to industry. Moreover, predictive maintenance, where in real time the Remaining Useful Lifetime (RUL) of the machine is calculated, has been proven to significantly outperforms other maintenance strategies, such as corrective maintenance [4]. In this work, RUL is defined as the time, from the current moment, that the systems will fail [5]. Failure, in this context, is defined as a deviation of the delivered output of a machine from the specified service requirements [6] that necessitate maintenance.
Models like Support Vector Machines [7], Dynamic Bayesian Networks [8], clustering techniques [9], and data mining approaches [10] have been successfully applied to condition monitoring, RUL estimation, and predictive maintenance problems [11, 12]. State space models, like Hidden Markov Models (HMMs) [13], are particularly suitable to be used in industrial applications, due to their ability to model the latent state which represents the health condition of the machine.
Classical HMMs have been applied to condition assessment [14, 15]; however, their usage in predictive maintenance has not been effective due to their intrinsic modeling of the state duration as a geometric distribution.
To overcome this drawback, a modified version of HMM, which takes into account an estimate of the duration in each state, has been proposed in the works of TobonMejia et al. [16–19]. Thanks to the explicit state sojourn time modeling, it has been shown that it is possible to effectively estimate the RUL for industrial equipment. However, the drawback of their proposed HMM model is that the state duration is always assumed as Gaussian distributed and the duration parameters are estimated empirically from the Viterbi path of the HMM.
A complete specification of a duration model together with a set of learning and inference algorithms has been given firstly by Ferguson [20]. In his work, Ferguson allowed the underlying stochastic process of the state to be a semiMarkov chain, instead of a simple Markov chain of a HMM. Such model is referred to as Hidden SemiMarkov Model (HSMM) [21]. HSMMs and explicit duration modeles have been proven beneficial for many applications [22–25]. A complete overview of different duration model classes has been made by Yu [26]. Most state duration models, used in the literature, are nonparametric discrete distributions [27–29]. As a consequence, the number of parameters that describe the model and that have to be estimated is high, and consequently the learning procedure can be computationally expensive for real complex applications. Moreover, it is necessary to specify a priori the maximum duration allowed in each state.
To alleviate the high dimensionality of the parameter space, parametric duration models have been proposed. For example, Salfner [6] proposed a generic parametric continuous distribution to model the state sojourn time. However, in their model, the observation has been assumed to be discrete and applied to recognize failureprone observation sequence. Using continuous observation, Azimi et al. [30–32] specified an HSMM with parametric duration distribution belonging to the Gamma family and modeled the observation process by a Gaussian.
Inspired by the latter two approaches, in this work we propose a generic specification of a parametric HSMM, in which no constraints are made on the model of the state duration and on the observation processes. In our approach, the state duration is modeled as a generic parametric density function. On the other hand, the observations can be modeled either as a discrete stochastic process or as continuous mixture of Gaussians. The latter has been shown to approximate, arbitrarily closely, any finite, continuous density function [33]. The proposed model can be generally used in a wide range of applications and types of data. Moreover, in this paper we introduce a new and more effective estimator of the time spent by the system in a determinate state prior to the current time. To the best of our knowledge, a part from the above referred works, the literature on HSMMs applied to prognosis and predictive maintenance for industrial machines is limited [34]. Hence, the present work aims to show the effectiveness of the proposed duration model in solving condition monitoring and RUL estimation problems.
Dealing with state space models, and in particular of HSMMs, one should define the number of states and correct family of duration density, and in case of continuous observations, the adequate number of Gaussian mixtures. Such parameters play a prominent role, since the right model configuration is essential to enable an accurate modeling of the dynamic pattern and the covariance structure of the observed time series. The estimation of a satisfactory model configuration is referred to as model selection in literature.
While several stateoftheart approaches use expert knowledge to get insight on the model structure [15, 35, 36], an automated methodology for model selection is often required. In the literature, model selection has been deeply studied for a wide range of models. Among the existing methodologies, information based techniques have been extensively analyzed in literature with satisfactory results. Although Bayesian Information Criterion (BIC) is particularly appropriate to be used in finite mixture models [37, 38], Akaike Information Criterion (AIC) has been demonstrated to outperform BIC when applied to more complex models and when the sample size is limited [39, 40], which is the case of the target application of this paper.
In this work AIC is used to estimate the correct model configuration, with the final goal of an automated HSMMs model selection, which exploits only the information available in the input data. While model selection techniques have been extensively used in the framework of Hidden Markov Models [41–43], to the best of our knowledge, the present work is the first that proposes their appliance to duration models and in particular to HSMMs.
In summary, the present work contributes to condition monitoring, predictive maintenance, and RUL estimation problems by(i)proposing a general Hidden SemiMarkov Model applicable for continuous or discrete observations and with no constraints on the density function used to model the state duration;(ii)proposing a more effective estimator of the state duration variable , that is, the time spent by the system in the th state, prior to current time ;(iii)adapting the learning, inference and prediction algorithms considering the defined HSMM parameters and the proposed estimator;(iv)using the Akaike Information Criterion for automatic model selection.
The rest of the paper is organized as follows: in Section 2 we introduce the theory of the proposed HSMM together with its learning, inference, and prediction algorithms. Section 3 gives a short theoretical overview of the Akaike Information Criterion. Section 4 presents the methodology used to estimate the Remaining Useful Lifetime using the proposed HSMM. In Section 5 experimental results are discussed. The conclusion and future research directions are given in Section 6.
2. Hidden SemiMarkov Models
Hidden SemiMarkov Models (HSMMs) introduce the concept of variable duration, which results in a more accurate modeling power if the system being modeled shows a dependence on time.
In this section we give the specification of the proposed HSMM, for which we model the state duration with a parametric statedependent distribution. Compared to nonparametric modeling, this approach has two main advantages:(i)the model is specified by a limited number of parameters; as a consequence, the learning procedure is computationally less expensive;(ii)the model does not require the a priori knowledge of the maximum sojourn time allowed in each state, being inherently learnt through the duration distribution parameters.
2.1. Model Specification
A Hidden SemiMarkov Model is a doubly embedded stochastic model with an underlying stochastic process that is not observable (hidden) but can only be observed through another set of stochastic processes that produce the sequence of observations. HSMM allows the underlying process to be a semiMarkov chain with a variable duration or sojourn time for each state. The key concept of HSMMs is that the semiMarkov property holds for this model: while in HMMs the Markov property implies that the value of the hidden state at time depends exclusively on its value of time , in HSMMs the probability of transition from state to state at time depends on the duration spent in state prior to time .
In the following we denote the number of states in the model as , the individual states as , and the state at time as . The semiMarkov property can be written as where the duration variable is defined as the time spent in state prior to time .
Although the state duration is inherently discrete, in many studies [44, 45] it has been modeled with a continuous parametric density function. Similar to the work of Azimi et al. [30–32], in this paper, we use the discrete counterpart of the chosen parametric probability density function (pdf). With this approximation, if we denote the pdf of the sojourn time in state as , where represents the set of parameters of the pdf relative to the th state, the probability that the system stays in state for exactly time steps can be calculated as . Considering the HSMM formulation, we can generally denote the state dependent duration distributions by the set of their parameters relative to each state as .
Many related works on HSMMs [31, 32, 44, 45] consider within the exponential family. In particular, Gamma distributions are often used in speech processing applications. In this work, we do not impose a type of distribution function to model the duration. The only requirement is that the duration should be modeled as a positive function, being negative durations physically meaningless.
HSMMs require also the definition of a “dynamic” transition matrix, as a consequence of the semiMarkov property. Differently from the HMMs in which a constant transition probability leads to a geometric distributed state sojourn time, HSMMs explicitly define a transition matrix which, depending on the duration variable, has increasing probabilities of changing state as the time goes on. For convenience, we specify the state duration variable in a form of a vector with dimensions as The quantity can be easily calculated by induction from as where is if , otherwise.
If we assume that at time the system is in state , we can formally define the durationdependent transition matrix as with
The specification of the model can be further simplified by observing that, at each time , the matrix can be decomposed in two terms: the recurrent and the nonrecurrent state transition probabilities.
The recurrent transition probabilities , which depend only on the duration vector and the parameters , take into account the dynamics of the selftransition probabilities. It is defined as the probability of remaining in the current state at the next time step, given the duration spent in the current state prior to time : The denominator in (5) can be expressed as , which is the probability that the system, at time , has been staying in state for at least time units. The above expression is equivalent to , where is the duration cumulative distribution function relative to the the state , that is, . As a consequence, from (5) we can define the recurrent transition probabilities as a diagonal matrix with dimensions , as The usage of the cumulative functions in (6), which tend to 1 as the duration tends to infinity, suggests that the probability of selftransition tends to decrease as the sojourn time increases, leading the model to always leave the current state if time approaches infinity.
The nonrecurrent state transition probabilities, , rule the transitions between two different states. It is represented by a matrix with the diagonal elements equal to zero, defined as must be specified as a stochastic matrix; that is, its elements have to satisfy the constraint for all .
As a consequence of the above decomposition, the dynamic of the underlying semiMarkov chain can be defined by specifying only the statedependent duration parameters and the nonrecurrent matrix , since the model transition matrix can be calculated, at each time , using (6) and (7): where is the identity matrix. If we denote the elements of the dynamic transition matrix as , the stochastic constraint for all and is guaranteed from the fact that is a diagonal matrix and is a stochastic matrix.
For several applications it is necessary to model the absorbing state which, in the case of industrial equipment, corresponds to the “broken” or “failure” state. If we denote the absorbing state as with , we must fix the th row of the nonrecurrent matrix to be and for all with . By substituting such matrix in (8), it is easy to show that the element and remains constant for all , while the duration probability parameters are not influent for the absorbing state . An example of absorbing state specification will be given in Section 5.
With respect to the input observation signals, in this work we consider both continuous and discrete data, by adapting the suitable observation model depending on the observation nature. In particular, for the continuous case, we model the observations with a multivariate mixture of Gaussians distributions. This choice presents two main advantages: (i) a multivariate model allows to deal with multiple observations at the same time; this is often the case of industrial equipments modeling since, at each time, multiple sensors’ measurements are available, and (ii) mixture of Gaussians has been proved to closely approximate any finite and continuous density function [33]. Formally, if we denote by the observation vector at time and the generic observation vector being modeled as , the observation density for the th state is represented by a finite mixture of gaussians where is the mixture coefficient for the th mixture in state , which satisfies the stochastic constraint for and for and , while is the Gaussian density, with mean vector and covariance matrix for the th mixture component in state .
In case of discrete data, we model the observations within each state with a nonparametric discrete probability distribution. In particular, if is the number of distinct observation symbols per state and if we denote the symbols as and the observation at time as , the observation symbol probability distribution can be defined as a matrix of dimensions where Since the system in each state at each time step can emit one of the possible symbols, the matrix is stochastic; that is, it is constrained to for all .
Finally, as in the case of HMMs, we specify the initial state distribution which defines the probability of the starting state as
From the above considerations, two different HSMM models can be considered. In the case of continuous observation, , and in the case of discrete observation the HSMM is characterized by . An example of continuous HSMM with 3 states is shown in Figure 1.
2.2. Learning and Inference Algorithms
Let us denote the generic sequence of observations, being indiscriminately continuous vectors or discrete symbols, as ; in order to use the defined HSMM model in practice, similarly to the HMM, we need to solve three basic problems.(1)Given the observation and a model , calculate the probability that the sequence has been generated by the model , that is, .(2)Given the observation and a model , calculate the state sequence which have most probably generated the sequence .(3)Given the observation find the parameters of the model which maximize .
As in case of HMM, solving the above problems requires using the forwardbackward [13], decoding (Viterbi [46] and Forney [47]) and Expectation Maximization [48] algorithms, which will be adapted to the HSMM introduced in Section 2.1. In the following, we also propose a more effective estimator of the state duration variable defined in (2).
2.2.1. The ForwardBackward Algorithm
Given a generic sequence of observations , the goal is to calculate the model likelihood, that is, . This quantity is useful for the training procedure, where the parameters that locally maximize the model likelihood are chosen, as well as for classification problems. The latter is the case in which the observation sequence has to be mapped to one of a finite set of classes, represented by a set of HSMM parameters . The class of is chosen such that .
To calculate the model likelihood, we first define the forward variable at each time as
Contrarily to HMMs, for HSMMs the state duration must be taken into account in the the forward variable calculation. Consequently, Yu [26] proposed the following inductive formula: that is, the sum of the probabilities of being in the current state (at time ) for the past time units, (with and the maximum allowed duration for each state), coming from all the possible previous states , , and .
The disadvantage of the above formulation is that, as discussed in Introduction, the specification of the maximum duration represents a limitation to the modeling generalization. Moreover, from (13), it is clear that the computation and memory complexities drastically increase with , which can be very large in many applications, in particular for online failure prediction.
To alleviate this problem, Azimi et al. [30–32] introduced a new forward algorithm for HSMMs that, by keeping track of the estimated average state duration at each iteration, has a computational complexity comparable to the forward algorithm for HMMs [13]. However, the average state duration represents an approximation. Consequently, the forward algorithm of Azimi, compared with (13), pays the price of a lower precision in favor of a (indispensable) better computational efficiency.
To calculate the forward variable using Azimi’s approach, the durationdependent transition matrix, defined in (8), is taken in consideration in the induction formula of (13), which becomes [30] To calculate the above formula, the average state duration of (2) must be estimated, for each time , by means of the variable , defined as where denotes the expected value. To calculate the above quantity, Azimi et al. [30–32] use the following formula: where represents the element by element product between two matrices/vectors and the vector (the probability of being in state at time given the observation sequence and the model parameters) with dimensions is calculated in terms of as
Equation (16) is based on the following induction formula [30–32] that rules the dynamics of the duration vector when the system’s state is known: where, for each , is if , otherwise.
A simple example shows that (18) is incorrect: assuming an HSMM with three states and considering the state sequence the correct sequence of the duration vector is , , and , where the superscript denotes vector transpose. If we apply (18), we obtain , , and , which is in contradiction with the definition of the state duration vector given in (2).
To calculate the average state duration variable we propose a new induction formula that estimates, for each time , the time spent in the th state prior to as The derivation of (20) is given in Appendix. The intuition behind (19) is that the current average duration is the previous average duration plus one, weighted with the “amount” of the current state that was already in state in the previous step.
Using the proposed (20), the forward algorithm can be specified as follows:(1)initialization, with : where is estimated using (6);(2)induction, with and : where are the coefficients of the matrix ;(3)termination:
Similar considerations as the forward procedure can be made for the backward algorithm, which is implemented by defining the variable as
Having estimated the dynamic transition matrix for each using (24), the backward variable can be calculated inductively as follows.(1)Initialization: (2)Induction:
Although the variable is not necessary for the calculation of the model likelihood, it will be useful in the parameter reestimation procedure, as it will be explained in Section 2.2.3.
2.2.2. The Viterbi Algorithm
The Viterbi algorithm [46, 47] (also known as decoding) allows determining the best state sequence corresponding to a given observation sequence.
Formally, given a sequence of observation , the best state sequence corresponding to is calculated by defining the variable as
The procedure to recursively calculate the variable and to retrieve the target state sequence (i.e., the arguments which maximize the ’s) for the proposed HSMM is a straightforward extension of the Viterbi algorithm for HMMs [13]. The only change is the usage, in the recursive calculation of , of the dynamic transition matrix , calculated through (24). The Viterbi algorithm for the introduced parametric HSMMs can be summarized as follows:(1)initialization, with : (2)recursion, with and : (3)termination: where we keep track of the argument maximizing (31) using the vector , which, tracked back, gives the desired best state sequence:
2.2.3. The Training Algorithm
The training algorithm consists of estimating the model parameters from the observation data. As discussed in Section 2.1, a parametric HSMM is defined by if the observations are continuous, or if the observations are discrete. Given a generic observation sequence , referred to as training set in the following, the training procedure consists of finding the model parameter set which locally maximizes the model likelihood .
We use the modified BaumWelch algorithm of Azimi et al. [30–32]. However, in our implementation we do not make assumption on the density function used to model the state duration, and we consider both continuous and discrete observations.
Being a variant of the more general ExpectationMaximization (EM) algorithm, BaumWelch is an iterative procedure which consists of two steps: (i) the expectation step, in which the forward and backward variables are calculated and the model likelihood is estimated and (ii) the maximization step, in which the model parameters are updated and used in the next iteration. This process usually starts from a random guess of the model parameters and it is iterated until the likelihood function does not improve between two consecutive iterations.
Similarly to HMMs, the reestimation formulas are derived by firstly introducing the variable , which represents the probability of being in state at time , and in state at time , given the model and the observation sequence, as However, in the HSMM case, the variable considers the duration estimation performed in the forward algorithm (see Equation (24)). Formulated in terms of the forward and backward variables, it is given by
From we can derive the quantity (already defined in (17)) representing the probability of being in state at time given the observation sequence and the model parameters:
Finally, the the reestimation formulas for the parameters and are given by where is a square matrix of dimensions where for and for , represents the element by element product between two matrices, is the expected number of transitions from state , and is the expected number of transitions from state to state .
Equation (39) represents the expected number of times that the model starts in state , while (40) represents the expected number of transitions from state to state with over the total expected number of transitions from state to any other state different from .
For the matrix , being normalized, the stochastic constraints are satisfied at each iteration, that is, for each , while the estimation of the prior probability inherently sums up to at each iteration, since it represents the expected frequency in state at time for each .
With respect to the reestimation of the state duration parameters , firstly, we estimate the mean and the variance of the th state duration for each , from the forward and backward variables and the estimation of the state duration variable where (41) can be interpreted as the probability of transition from state to with at time weighted by the duration of state at , giving the desired expected value, while in (42) the same quantity is weighted by the squared distance of the duration at time from its mean, giving the estimation of the variance.
Then, the parameters of the desired duration distribution can be estimated from and . For example, if a Gamma distribution with shape parameter and scale parameter is chosen to model the state duration, the parameters and for each can be calculated as and .
Concerning the observation parameters, once the modified forward and backward variables, accounting for the state duration, are defined as in (22) and (28), the reestimation formulas are the same as for Hidden Markov Models [13].
In particular, for continuous observations, the parameters of the Gaussians’ mixture defined in (9) are reestimated by firstly defining the probability of being in state at time with the probability of the observation vector evaluated by the th mixture component, as By using the former quantity, the parameters , , and are reestimated through the following formulas: where superscript denotes vector transpose.
For discrete observations, the reestimation formula for the observation matrix is where the quantity , which takes into account the duration dependent forward variable , is calculated through (17).
The reader is referred to Rabiner’s work [13] for the interpretation on the observation parameters reestimation formulas.
3. AICBased Model Selection
In the framework of the proposed parametric HSMMs, the model selection procedure aims to select the optimal number of hidden states , the right duration distribution family, and, in the case of mixture observation modeling, the number of Gaussian mixtures to be used. In this work, we make use of the Akaike Information Criterion (AIC). Indeed, it has been seen that in case of complex models and in presence of a limited number of training observations, AIC represents a satisfactory methodology for model selection, outperforming other approaches like Bayesian Information Criterion.
In general, information criteria are represented as a twoterm structure. They account for a compromise between a measure of model fitness, which is based on the likelihood of the model, and a penalty term which takes into account the model complexity. Usually, the model complexity is measured in terms of the number of parameters that have to be estimated and in terms of the number of observations.
The Akaike Information Criterion is an estimate of the asymptotic value of the expected distance between the unknown true likelihood function of the data and the fitted likelihood function of the model. In particular, the AIC can be expressed as where is the likelihood of the model with the estimated parameters , as defined in (25), is the number of model parameters, and is the length of the observed sequence. The best model is the one minimizing equation (46).
Concerning , the number of parameters to be estimated for a parametric HSMM with states is , where are the parameters of the hidden states layer, while are those of the observation layer.
In particular where(i) accounts for the prior probabilities ;(ii) accounts for the nonrecurrent transition matrix ;(iii) accounts for the duration probability, being the number of parameters of the duration distribution.
Concerning , a distinction must be made between discrete and continuous observations:(i)in the case of discrete observations with possible observable values, , which accounts for the elements of the observation matrix ;(ii)if the observations are continuous and a multivariate mixture of Gaussians with variates is used as observation model, where each term accounts, respectively, for the mean vector , the covariance matrix , and the mixture coefficients .
4. Remaining Useful Lifetime Estimation
One of the most important advantages of the time modeling of HSMMs is the possibility to effectively face the prediction problem. The knowledge of the state duration distributions allows the estimation of the remaining time in a certain state and, in general, the prediction of the expected time before entering in a determinate state.
As already mentioned, an interesting application of the prediction problem is the Remaining Useful Lifetime (RUL) estimation of industrial equipments. Indeed, if each state of an HSMM is mapped to a different condition of an industrial machine and if the state that represents the failure condition is identified, at each moment, the RUL can be defined as the expected time to reach the failure state . If we assume that the time to failure is a random variable following a determinate probability density, we define the RUL at the current time as where denotes the expected value.
Having fixed the failure state, the estimation of the RUL is performed, in two steps, every time a new observation is acquired (online):(1)estimation of the current state;(2)projection of the future state transitions until the failure state is reached and estimation of the expected sojourn time.
The estimation of the current state is performed via the Viterbi path, that is, the variable defined in (29). To correctly model the uncertainty of the current state estimation, we use the normalized variable obtained as that is, an estimate of the probability of being in state at time .
Together with the normalized variable , the maximum a posteriori estimate of the current state is taken into account according to (34). If coincides with the failure state, the desired event is detected by the model and the time to this event is obviously zero. Otherwise, an estimation of the average remaining time in the current state is calculated as where with we denote the expected value of the duration variable in state according to the duration distribution specified by the parameters . Equation (49) thus estimates the remaining time in the current state by subtracting the estimated states duration, , at time from the expected sojourn time of state , and weighting the result using the uncertainty about the current state, , and, finally, by summing up all the contributions from each state.
In addition to the average remaining time, a lower and an upper bound value can be calculated based on the standard deviation, , of the duration distribution for state :
Once the remaining time in the current state is estimated, the probability of the next state is calculated by multiplying the transpose of the nonrecurrent transition matrix by the current state probability estimation as follows: while the maximum a posteriori estimate of the next state is calculated as
Again, if coincides with the failure state, the failure will happen after the remaining time at the current state is over and the average estimation of the failure time is calculated at the previous step, with the bound values and . Otherwise the estimation of the sojourn time of the next state is calculated as follows:
This procedure is repeated until the failure state is encountered in the prediction of the next state. The calculation of the RUL is then simply obtained by summing all the estimated remaining time in each intermediate state before encountering the failure state:
Finally, Algorithm 1 details the above described RUL estimation procedure.

5. Experimental Results
To demonstrate the effectiveness of the proposed HSMM models, we make use of a series of experiments, performed both on simulated and real data.
The simulated data were generated by considering a leftright HSMM and adapting the parameters of the artificial example reported in the work of Lee et al. [15]. The real case data are monitoring data related to the entire operational life of bearings, made available for the IEEE PHM 2012 data challenge (http://www.femtost.fr/en/Researchdepartments/AS2M/Researchgroups/PHM/IEEEPHM2012Datachallenge.php).
5.1. Simulated Experiment
Data have been generated with the idea of simulating the behavior of an industrial machine that, during its functioning, experiences several degradation modalities until a failure state is reached at the end of its lifetime. The generated data are used to test the performances of our methodology for (i) automatic model selection, (ii) online condition monitoring, and (iii) Remaining Useful Lifetime estimation, considering both continuous and discrete observations.
5.1.1. Data Generation
The industrial machine subject of these experiments has been modeled as a leftright parametric HSMM, with states, having state as absorbing (failure) state. The choice of a leftright setting has been made for simplicity reasons, since the primary goal of this work is to demonstrate that the proposed model specification coupled with the Akaike Information Criterion is effective to solve automatic model selection, online condition monitoring, and prediction problems. At this purpose, we divided the experiments in two cases according to the nature of the observation, being continuous or discrete.
For each of the continuous and the discrete cases, three data sets have been generated by considering the following duration models: Gaussian, Gamma, and Weibull densities. For each of the three data sets, 30 series of data are used as training set and 10 series as testing set. Each time series contains observations. The parameters used to generate the data are taken from the work of Lee et al. [15] and are adapted to obtain an equivalent leftright parametric HSMM as follows: where , , and are the different distribution parameters used for the Gaussian, Gamma, and Weibull duration models, respectively. More precisely, they represent the values of the mean and the the variance of the Gaussian distribution, the shape and the scale of the Gamma distribution, and the scale and the shape of the Weibull distribution. It must be noticed that, as explained in Section 2.1, for state , being the absorbing state, the duration parameters have no influence on the data, since, once the state is reached, the system will remain there forever.
Concerning the continuous observation modeling, a bivariate Gaussian distribution has been used with the following parameters [15]: while for the discrete case, distinct observation symbols have been taken into consideration with the following observation probability distribution An example of simulated data both for the continuous and the discrete cases is shown in Figure 2, where a Gaussian duration model has been used.
(a) Example of simulated data for the continuous case
(b) Example of simulated data for the discrete case
5.1.2. Training and Model Selection
The goal of this experimental phase is to test the effectiveness of the AIC in solving the automatic model selection. For this purpose, the training sets of the 6 data sets (continuous/discrte observation with Gaussian, Gamma, and Weibull duration models) have been taken individually and, for each one of them, a series of learning procedure has been run, each one with a variable HSMM structure. In particular we took into account all the combinations of the duration distribution families (Gaussian, Gamma, and Weibull), an increasing number of states, , from 2 to 8 and, for the continuous observation cases, an increasing number of Gaussian mixtures, , in the observation distribution from 1 to 4.
As accurate parameter initialization is crucial for obtaining a good model fitting [14], a series of 40 learning procedures corresponding to 40 random initializations of the initial parameters have been executed for each of the considered HSMM structures. For each model structure, the AIC value, as defined in (46), has been evaluated. The final trained set of parameters corresponding to the minimum AIC value has been retained, resulting in 7 HSMMs with a number of states from 2 to 8.
The obtained results are shown in Figure 3, for both, the continuous and discrete observation data. As it can be noticed, for all the 6 test cases of Figure 3 the AIC values do not improve much for a number of states higher than 5, meaning that adding more states does not add considerable information to the HSMM modeling power. Hence 5 states can be considered as an optimal number of states. Moreover, as shown in the zoomed sections of Figure 3, for the HSMMs with 5 states, the minimum AIC values are obtained for the duration distributions corresponding to the ones used to generate the data. As a consequence AIC can be considered as an effective approach to perform model selection for HSMM, as well as selecting the appropriate parametric distribution family for the state duration modeling.
(a) AIC values for continuous data and Gaussian duration distribution
(b) AIC values for continuous data and Gamma duration distribution
(c) AIC values for continuous data and Weibull duration distribution
(d) AIC values for discrete data and Gaussian duration distribution
(e) AIC values for discrete data and Gamma duration distribution
(f) AIC values for discrete data and Weibull duration distribution
5.1.3. Condition Monitoring
The optimal parameters obtained in the previous phase have been used to perform the online condition monitoring experiment on the 10 testing cases for all the 6 considered HSMM configurations. In this experiment, we simulate online monitoring by considering all the testing observations up to the current time, that is, . Each time a new data point is acquired, the Viterbi algorithm is used to estimate the current state as specified in (34).
An example of execution of the condition monitoring experiment is shown in Figure 4, for both, continuous and discrete observations, respectively. In Figure 4(a) the state duration has been modeled with a Gamma distribution, while in Figure 4(b) with a Gaussian distribution. In Figures 4(a) and 4(b), the first display represents the true state of the HSMM and the second display represents the estimated state from the Viterbi algorithm, while the third display represents the observed time series.
(a) State estimation with Viterbi path for continuous data and Gamma duration distribution
(b) State estimation with Viterbi path for discrete data and Gaussian duration distribution
Knowing the true state sequence we calculated the accuracy, defined as the percentage of correctly estimated states over the total length of the state sequence, for each of the testing cases. The results are summarized in Table 1(a) for the continuous observations and in Table 1(b) for the discrete observations. The high percentage of correct classified states shows that HSMMs can be effectively used to solve condition monitoring problems for applications in which the system shows a strong time and state duration dependency.
(a) Continuous observations  
 
(b) Discrete observations  

5.1.4. Remaining Useful Lifetime Estimation
In this experimental phase we considered the state as the failure state and the trained parameters of Section 5.1.2 for each HSMM configuration. As the online RUL estimation procedure is intended to be used in real time, we simulated condition monitoring experiment where we progressively consider the observations up to time . When a new observation is acquired, after the current state probability is estimated (Equation (48)), the calculation of the average, upper, and lower RUL ((57), (58), and (59)) is performed.
Examples of RUL estimation are illustrated in Figure 5. In particular Figure 5(a) represents the case of continuous observations and duration modeled by a Weibull distribution, while Figure 5(b) shows the case of discrete observations and duration modeled by a Gamma distribution. From the figures one can notice that the average, as well as the lower and the upper bound estimations, converges to the real RUL as the failure time approaches. Moreover, as expected, the uncertainty about the estimation decreases with the time, since predictions performed at an early stage are more imprecise. As a consequence, the upper and the lower bound become more narrow as the failure state approaches, and the estimation becomes more precise until it converges to the actual RUL value, with the prediction error tending to zero at the end of the evaluation.
(a) Remaining Useful Lifetime estimation for continuous data and Weibull duration distribution
(b) Remaining Useful Lifetime estimation for discrete data and Gamma duration distribution
To quantitatively estimate the performance of our methodology for the RUL estimation, we considered at each time the absolute prediction error (APE) between the real RUL and the predicted value defined as where is the (known) value of the RUL at time , while is RUL predicted by the model. To evaluate the overall performance of our methodology, we considered the average absolute prediction error of the RUL estimation, defined as where is the length of the testing signal. being a prediction error average, values of (64) close to zero correspond to good predictive performances.
The results for each of the 10 testing cases and the different HSMM configurations are reported in Tables 2(a) and 2(b), for the continuous and the discrete observation cases, respectively. As it can be noticed, the online Remaining Useful Lifetime estimation and in general the online prediction of the time to a certain event can be effectively faced with HSMMs, which achieve a reliable estimation power with a small prediction error.
(a) of the RUL estimation for the continuous observation test cases  
 
(b) of the RUL estimation for the discrete observation test cases  

Finally, we tested our RUL estimation methodology using the state duration estimation of (16) introduced by Azimi et al. [30–32]. The results are shown in Tables 3(a) and 3(b), in which, respectively, the prediction errors obtained for continuous and discrete observations are reported.
(a) of the RUL estimation for the continuous observation test cases  
 
(b) of the RUL estimation for the discrete observation test cases  

Comparing Table 2 and Table 3, one can notice that the proposed RUL method outperforms the one of Azimi. This is mainly due to the proposed average state duration of (20), compared to the one of Azimi given by (16).
5.2. Real Data
In this section we apply the proposed HSMMbased approach for RUL estimation to a real case study using bearing monitoring data recorded during experiments carried out on the Pronostia experimental platform and made available for the IEEE Prognostics and Health Management (PHM) 2012 challenge [49]. The data correspond to normally degraded bearings, leading to cases which closely correspond to the industrial reality.
The choice of testing the proposed methodology on bearings derives from two facts: (i) bearings are the most critical components related to failures of rotating machines [50] and (ii) their monotonically increasing degradation pattern justifies the usage of leftright HSMM models.
5.2.1. Data Description
Pronostia is an experimental platform designed and realized at the Automatic Control and MicroMechatronic Systems (AS2M) Department of FrancheComté Electronics, Mechanics, Thermal Processing, OpticsScience and Technology (FEMTOST) Institute (http://www.femtost.fr/) (Besançon, France), with the aim of collecting real data related to accelerated degradation of bearings. Such data are used to validate methods for bearing condition assessment, diagnostic and prognostic [19, 51–59].
The Pronostia platform allows to perform runtofailure tests under constant or variable operating conditions. The operating conditions are determined by two factors that can be controlled online: (i) the rotation speed and (ii) the radial force load. During each experiment, temperature and vibration monitoring measurements are gathered online, through two type of sensors placed in the bearing housing: a temperature probe and two accelerometers (one on the vertical and one on the horizontal axis).
The platform is composed of three main parts: a rotating part, a load profile generation part, and a measurement part, as illustrated in Figure 6.
The rotating part is composed of an asynchronous motor which develops a power equal to 250 W, two shafts, and a gearbox, which allows the motor to reach its rated speed of 2830 rpm. The motor’s rotation speed and the direction are set through a human machine interface.
The load profiles part issues a radial force on the external ring of the test bearing through a pneumatic jack connected to a lever arm, which indirectly transmits the load through a clamping ring. The goal of the applied radial force is to accelerate the bearing’s degradation process.
The measurement part consists of a data acquisition card connected to the monitoring sensors, which provides the user with the measured temperature and vibration data. The vibration measurements are provided in snapshots of 0.1 s collected each 10 seconds at a sampling frequency of 25.6 kHz (2560 samples per each snapshot), while the temperature has been continuously recorded at a sampling frequency of 10 Hz (600 samples collected each minute).
Further details on the Pronostia test rig can be found on the data presentation paper [49] and on the web page of the data challenge (http://www.femtost.fr/en/Researchdepartments/AS2M/Researchgroups/PHM/IEEEPHM2012Datachallenge.php).
Regarding the data provided for the PHM 2012 challenge, 3 different operating conditions were considered:(i)first operating conditions: speed of 1800 rpm and load of 4000 Newton;(ii)second operating conditions: speed of 1650 rpm and load of 4200 Newton;(iii)third operating conditions: speed of 1500 rpm and load of 5000 Newton.
Under the above operating conditions, a total of 17 accelerated life tests were realized on bearings of type NSK 6804 DD, which can operate at a maximum speed of 13000 rpm and a load limit of 4000 N. The tests were stopped when the amplitude of the vibration signal was higher than 20 g; thus this moment was defined as the bearing failure time. An example of bearing before and after the experiment is shown in Figure 7 together with the corresponding vibration signal collected during the whole test.
Table 4 reports how the 17 tested bearings were separated into the three operating conditions. Moreover, the duration of each experiment, being the RUL to be predicted for each bearing, is also given. We performed two sets of experiments by considering, respectively, the bearings relative to the first and the second operating condition (i.e., Bearing1_1, Bearing1_2, …, Bearing1_7 and Bearing2_1, Bearing2_2, …, Bearing2_7).

As already mentioned, the available data correspond to normally degraded bearings, meaning that the defects were not initially induced and that each degraded bearing contains almost all the types of defects (balls, rings, and cage), resembling faithfully a common real industrial situation. Moreover, no assumption about the type of failure to be occurred is provided with the data and, since the variability in experiment durations is high (from 1 h to 7 h), performing good estimates of the RUL is a difficult task [49].
In our experiments we considered, as input to our model, the horizontal channel of the accelerometer. We preprocessed the raw signals by extracting two timedomain features, that is, root mean square (RMS) and kurtosis, within windows of the same length as the given snapshots (). Let be the raw signal of the th window; for each we estimate RMS as and kurtosis as , where is the mean of . An example of feature extraction for Bearing1_1 is shown in Figure 8.
To assess the performance of the proposed HSMM, after the model selection procedure, we implemented a leaveoneout cross validation scheme: by considering separately conditions 1 and 2, we performed the online RUL estimation for each of the 7 bearings, using an HSMM trained with the remaining 6 bearing histories. Similarly to the simulated case, we considered the average absolute prediction error, defined in (64), to quantitatively evaluate our method.
5.2.2. Bearings RUL Estimation
We performed our experiments in two steps: firstly we applied model selection in order to determine an optimal model structure, and secondly, we estimated the RUL of the bearings. The full procedure is detailed in the following.
(A) HSMM Structure. To determine an appropriate HSMM structure for effectively modeling the considered data, we considered several HSMM structures, characterized by (i) the duration distribution family (being Gaussian, Gamma, or Weibull), (ii) an increasing number of states, , from 2 to 6, and (iii) an increasing number of Gaussian mixtures, , in the observation density from 1 to 4. For each model structure, obtained by systematically considering all the combinations of (i) to (iii), we run 120 parameter learnings, corresponding to 120 random initializations, , on the data sets (Bearing1_1, Bearing1_2, …, Bearing1_7 and Bearing2_1, Bearing2_2, …, Bearing2_7).
Similar to Section 5.1.2, at the end of each learning we evaluated the AIC value (Equation (46)) as reported in Figures 9(a) and 9(b) for conditions 1 and 2, respectively. In both cases, the global minimum AIC value corresponds to an HSMM with states, a Weibull duration model, and a Gaussians mixture for the observation density.
(a) AIC values for Condition 1
(b) AIC values for Condition 2
(B) RUL Estimation. Using the above obtained optimal HSMM structure, we trained it via a leaveoneout cross validation scheme by using for condition 1, at each iteration, Bearing1_i, , as the testing bearing, while the remaining six bearings were used for training. Once the trained parameters were estimated for the th testing bearing, we progressively collected the observations of the tested Bearing1_i to calculate, at each time , the average, lower, and upper RUL, as specified in (57), (58), and (59), respectively, considering the state as the failure state. The same procedure has been performed for the bearings in condition 2.
Examples of RUL estimation for Bearing1_7 and Bearing2_6 are shown in Figures 10(a) and 10(b), respectively, where the black solid line represents the real RUL which goes to zero as the time goes on. As it can be seen, the average as well as the lower and the upper bound estimations converge to the real RUL as the real failure time approaches and the uncertainty about the estimation decreases with time.
(a) RUL estimation for Bearing1_7
(b) RUL estimation for Bearing2_6
Concerning the quantitative estimation of the predictive performances, we report in Table 5 the average absolute prediction error of the RUL estimation (see Equation (64)), expressed in seconds. As it can be noticed, the average absolute prediction error of the average RUL is, respectively, 1 hour and 15 minutes for condition 1, and 1 hour and 5 minutes for condition 2, which are good values, considering the high variability in the training set durations and the fact that the performance metric takes into account also the less accurate predictions performed at an early stage of the bearings life. Moreover, for both conditions, the average prediction errors of 5 tests out of 7 are below the average, while the best average error of the mean RUL is only 23 minutes for condition 1 while it further decreases to 14 minutes for condition 2.
(a) Condition 1  
 
(b) Condition 2  
