Current Chemometrics Research Projects
UD Home Page Chemistry and Biochemistry S.D. Brown Home Page Laboratory for Chemometrics

Projects

Improving the Robustness and Transfer of Multivariate Calibrations

Data and Model Fusion for Multivariate Classification and Calibration

Novel Methods for Improved Classification of High-Dimensional Data Sets




Novel Methods for Improved Classification of High-Dimensional Data Sets

Computer-aided classification of gas chromatographic signatures is being used to analyze complex chromatographic data from biological systems.  Modern-day gas chromatograms often yield information that can be exceedingly complex.  It is not unusual to see dozens and, in some cases, even hundreds of components resolved by highly efficient fused silica capillary columns.  Important information contained in such complex chromatograms is not readily interpretable without extensive use of advanced computational techniques. 

Standard statistical techniques, such as linear and quadratic discriminant analyses (LDA and QDA), have a sound foundation in statistics, and can be interpreted relatively easily. These methods are commonly used to help in the computer-aided classification of signatures from chemical instrumentation.  These discriminant methods, as noted above, make assumptions about the underlying class structures of the data [1,2], however. Conventional data analysis, using either ad hoc methods, or more conventional statistical methods such as PCA compression followed by discriminant analysis are critically dependent on the careful preprocessing of the data to warp the decision space in a way that reflects differences in data origins, in the best cases. Because of this limitation, getting a good idea of the presence of outliers or determining the uncertainty in a classification has generally been neglected. Pursuing optimal classification has usually led to a focus on the preprocessing step as a way to help reduce the space of the data without losing too much of the information carried in the chromatograms or introducing too much bias to the data analysis through assumptions as to the underlying form of the classes believe present in the dataset. A problem with this approach is that the emphasis is usually placed on separation of classes and not in an assessment of the uncertainty in any sample’s class assignment or on the reliable detection of outliers. There is also no easy way to relate the measured data to the class assignment and its uncertainty, or to incorporate expert, external information into the classification.

Other classification techniques, including the Bayes’ classifier, Soft Independent Modeling of Class Analogy (SIMCA), and neural networks make fewer assumptions about the intrinsic nature of the classes in the data, but can require larger amounts of data to build adequately predictive models [1,3]. Tree-based methods such as C4.5 or Classification and Regression Trees (CART) make no assumptions about class structure within the data and do not require large amounts of data, but traditionally have been restricted to making binary partitions of the data space in order to separate the classes[1,4].

While the methods of Bayesian statistics have long been a mainstay of data analysis in the areas of economics, social sciences and artificial intelligence, they have only recently been applied to chemical and biological problems [5-7].  The methods are not at all new. Inference and learning in Bayesian systems is a well documented area with many excellent sources [8-11]. Buntine [9] presents an excellent literature review of learning, while Pearl [11] provides much of the basis for computation in these systems.  Here, it will suffice to explain what is meant by these topics.  Inference is the process of turning local conditional probabilities into globally consistent tables and calculating posterior probabilities for nodes in the network.  Learning is the determination of the parameters of the network from data, means and covariances in the continuous case and probabilities in the discrete case. A Bayes classifier, shown in Figure 1, is a simple one-layer representation of a probabilistic discriminant.  This provides a probabilistic model with limited capacity, as it can only represent linear relationships. 
 
The directions of the arcs represent the causation of the value of X by the values of the evidence nodes Pa(X).  This Figure describes a standard pattern-recognition/classification problem.  Given a dataset of variables and their classification, the Bayesian classifier can be used to model the inherent relationships in the data and, given a test case, predict the classification of that sample with a level of certainty represented by the probability of that X takes on class value i.  A Bayesian classifier has several interesting features which we are interested in exploiting: 1) A probability of class assignment is returned instead of just a class assignment; 2) Missing data is handled through an inference procedure where the missing value is filled with E(Xk|e), where e refers to the current set of evidence; 3) Given any subset of evidence and a consistent network the probability of all other values can be calculated. 

The single-layer structure of the Bayesian classifier limits its usefulness and flexibility to linear discrimination.  Bayesian networks, on the other hand, allow for complex multi-layer interactions which are more able to model non-linear and concave discriminants.  The second important feature of Bayesian network is that it contains fewer parameters then a Bayesian classifier due to its more sparsely connected structure.  The classifier in Figure 1 requires 22 parameters (if all nodes are binary) to be specified. Figure 2 demonstrates a possible
 
alternative Bayesian network structure for the same problem, but this structure requires only 14 parameters to represent the same number of binary nodes.  This difference becomes more dramatic as the number of evidence nodes increases, the situation present in a complex data set such as that seen in chromatography. There are many different ways of learning this structure.  The most common structure learning algorithm is based on the EM parameter learning algorithm, termed Structural EM [10]. However, learning the optimal structure of a Bayesian network is not a trivial problem and is computationally intensive. Once  the structure is learned, the network classifies data very rapidly.

We have explored ways in which Bayes’ methods described above and decision tree methods such as CART might be combined to make improvements in the classification of high-dimension, complex systems in chemistry. We developed a novel hybrid method, combining a classifier based on Bayesian statistics and a CART variable selection step for analysis of small urine samples from infants to screen for about 20 metabolic birth defects using their chemical signatures in gas chromatographic profiles. The classifier we developed for this chemical analysis is shown in Figure 3.


 
Figure 3. Logical representation of a Bayesian network for classification of infant metabolic defects from gas chromatographic data. Blue ovals represent chromatographic peaks, grey ovals represent metabolic diseases and magenta ovals represent biochemical class of disease. The central grey oval indicates that there is/is not a metabolic defect present from peak data. Data are classified by loading the chromatographic data into the blue ovals and observing the values in the grey and magenta ovals.

This is an entirely new method for classification that has similarities to a Bayesian network. Unlike these classifiers, though, our method does not require an expert to provide the rules to the classifier for the structure learning step. Our methodology is independent of the type of data available or the problem to be solved, and it should be applicable to a wide variety of problems. Our hybrid classifier used CART to select highly informative variables for use in a Bayes’ classifier or a Bayes’ network. Through this combination, we gained an ability to discriminate outliers and to estimate the uncertainty of a classification result. 

More recently, we have worked to improve the CART selection step of the hybrid classifier by investigating different ways to improve decision rules, including use of fuzzy partitions [12-14]. We have also investigated the performance of the hybrid classifier in dealing with large amounts of missing data in a data set [15], including an analysis of a well-known GC fuelspill  dataset, and now are putting a Bayes classifier of the sort described above to the test with very large sets of complex GC-MS data from oil prospecting measurements. We have also developed a novel way of using classifiers at stem and leaf nodes in CART and have demonstrated its use in classification [16].

References

1.    B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, New York (1996).
2.    G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York (1992).
3.    M. Sharaf, D. Illman, and B. Kowalski, Chemometrics, Wiley, New York (1986).
4.    W. Venables, B. Ripley, Modern Applied Statistics with S-Plus, Springer- Verlag, New York (1994).
5.    K.L. Mello and S.D. Brown, J. Chemometrics, 1999, 13, 579-590.
6.    K.L. Mello and S.D. Brown, “Combining Recursive Partitioning and Uncertain Reasoning for Data Exploration and Characteristic Prediction,” Proc. AAAI Symposium on Predictive Toxicology, 1999, 119-122.
7.    K.L. Mello and S.D. Brown, “System for Discovering Implicit Relationships in Data and a Method of Using the Same,” US Patent  No. 6466929, Issued 10/15/02.
8.    D. Heckerman, “Learning Probabilistic Networks,” Microsoft Research Technical Report MSR-TR-95-06.
9.   W. L. Buntine,  Journal of Artificial Intelligence Research 1994, 2, 159-225.
10. A.P Dempster, N.M. Laird, D.B. Rubin.  Maximum Likelihood from Incomplete Data via the SEM algorithm.  Journal of the Royal Statistical Society, Series B, 39(1):1-38.
11. J. Pearl.  Probabilistic Reasoning in Intelligent Systems.  Morgan Kaufmann, San Mateo, California, 1988.
12. Y. Yuan and M.J. Shaw, Fuzzy Sets and Systems 1995, 60, 125-139.
13. A. Suarez and J.F. Lutsko, IEEE Trans. on Pattern Analysis and Machine Intell. 1999, 21, 1297-1311.
14. A.J. Myles and S.D. Brown, J. Chemometrics, 2003, 17, 531-536.
15. N.A. Woody and S.D. Brown, J. Chemometrics, 2003, 17,266-273.
16. N.A. Woody, J.T. Mondick, and S.D. Brown, J. Chem. Inf. Comp. Sci., submitted (2007).

Go to Top of Page



 

Data and Model Fusion for Multivariate Classification and Calibration

 It has long been recognized that making a chemical measurement provides a reduction of uncertainty as to the identity or amount of a chemical analyte, and that different measurement systems reduce that uncertainty by varying amounts. Eckschlager [1] and Winefordner [2] characterize an analytical measurement in terms of the information gain - the net change in uncertainty - afforded by the measurement. It was quickly realized that multivariate measurements offered the largest potential gains. Hirschfeld [3] also saw the relationship between measurement and information gain when he stressed the potential advantage of multidimensional, multivariate (hyphenated) analyses. Kowalski and others [4] pointed out the second-order advantage of some of these analyses. Now, some of the applications of chemometrics to “two-way” and higher dimensional measurement systems are well enough established to have reached application status.

There is no absolute requirement from information theory that the measurements be multidimensional to provide additional information, however. Much of the same information is, in principle, available from a one-dimensional vector made of separate chemical measurements. Multiple measurements made on the same sample are now increasingly feasible and affordable. It costs relatively little to collect another set of spectra on the same sample, especially if that sample is a part of a flowing stream in some chemical process. The problem lies in accessing that additional information. Repeated efforts have demonstrated that merely stringing spectra of different kinds together seldom improves a multivariate analysis. A similar problem exists with the combination of chemical sensors of different types to yield a fused signal. The use of multiple, high-throughput, nonspecific measurements done on a sample with subsequent mathematical analysis of those disparate outputs as though they were a single response is called data or sensor fusion. Analysis of fused data approximates the cognitive process used by humans to make inferences about their environment from disparate data available from their senses. There are several types of data fusion methods reported in the literature of artificial intelligence:

Low-level fusion. Here, the multiple responses for a sample are simply concatenated and processed as if they were a single signal. Because of its appealing simplicity, this is the most commonly tried form of data fusion by chemists. However, due to the difficulty in dealing with raw spectral data’s redundancies and noise by conventional chemometric means, there has been limited success from low-level fusion efforts.

Mid-level fusion. In this sort of data fusion, the aim is to extract a few, pertinent features from the fused multivariate data, then process them as a kind of “meta-signal”, rather than the string of fused data. This approach has not seen much study in chemometrics, as it is difficult to extract reliable features from typical spectra with any sense of their reliability, but it has received a good deal of attention in military applications. Given new capabilities in reducing irrelevant signal and background, mid-level fusion may have some promise for analysis of multiple chemical responses.

High-level fusion. This approach to data fusion has mostly focused on classification. In high-level fusion, the class outputs of individual sensor classifiers – often called the identity declaration - are the signals being subjected to fusion and analysis. Because class assignments and not analytical signals are being combined, the approach is suited to all types of measurements. High level fusion is often implemented using either heuristics (simple voting of the classifiers) or through probability estimation methods. One way to perform high-level fusion is to use Bayesian inference [5]. Another is based on possibility theory, which is a generalization of Bayesian methods for data with high uncertainty [6].

Data fusion can also be thought of in terms of the merging of databases from different sources into a single database. In combining databases, there may be missing individual values or, on occasion, missing variables, in that variables in one database may not appear in the second database to be merged with the first. This aspect of data fusion, then, deals with the imputation of missing values in partially shared data tables using Bayesian estimation from conditional probability tables [7]. That sense of data fusion, while useful, especially in database-intensive applications such as those in marketing and bioinformatics, is not the focus of this proposal, though it is an obvious extension of work mentioned here.

Researchers in chemometrics have, in the main, taken pains to avoid fusing data except for the analysis of some limited sensor array data where the sensors are generally similar in behavior. There is a recent paper in the classification of a small group of French wines from fused chemical data using low and high-level data fusion methods. In this work, the authors got little improvement from a low-level fusion, but when they used simple Bayesian classification, they were able to reduce the error rate for classification by as much as 10-fold [5].

Recent work in this group has encouraged us to undertake research in the analysis of fused data from typical measurements made on chemical systems because our expertise in the induction of fuzzy classifiers, especially those with leaf node classifiers, are ideally suited to mid- and high-level fusion efforts. Further, Bayesian networks and inference also have direct application, offering real advantages over the more limited Bayesian classifiers used by others. Our expertise in using Bayesian networks, especially those with continuous nodes, should also be beneficial. With our recent advances in methodology for isolating local effects relevant to the analyte, including methods developed in a parallel project, the future of fusion of chemical data appears bright.

References

1. Eckschlager, K.; Danzer, K. Information Theory in Analytical Chemistry, John Wiley (Pfeiffer), 1994.
2. Fitzgerald, J.; Winefordner, J.D. Rev. Anal. Chem., 1975, 2(4) 299.
3. Hirschfeld, T. Analyt. Chem., 1980, 52(2), 297A.
4. Fleming, C. ; Kowalski B.R. In Encyclopedia of Analytical Chemistry. Myers, R.A., Ed. v. 11, Wiley and Sons: Chichester, UK, 9736-64 (2000).
5. S. Roussel, S.; Bellon-Maurel, V.; Roger, J.-M.; Grenier ,P. Chemom. Intell. Lab. Syst. 2003, 65, 209.
6. Caraux, G.; Lechevallier, Y. Artif. Intell. Rev. 1996, 10, 219.
7. Saporta,G. Comp. Stat. Data Analysis 2002, 38, 465.

Go to Top of Page



Improving the Robustness and Transfer of Multivariate Calibrations

.A significant limitation to the use of multivariate regression in NIR spectrometry and elsewhere is the difficulty with which a predictive multivariate calibration model developed from spectra under one set of environmental and instrumental conditions is used on similar spectra obtained under different conditions, a process known as “standardization” or “calibration transfer” in the literature of multivariate calibration [1,2]. We have recently reviewed the literature of multivariate calibration transfer in an extensive, critical review [3], covering, among others, approaches aimed at correcting for the unintended instrumental variation incorporated in spectra and subsequently imbedded in the multivariate calibration model. These approaches span the use of simple multivariate regression [4-6], orthogonal signal correction (OSC) [7], genetic algorithms (GA) [8], Fourier and wavelet transforms [9-11], neural networks (NN)[12], maximum likelihood principal component analysis (MLPCA)[13], finite impulse response (FIR) filtering [14,15], and positive matrix factorization (PMF) [16]. Our FIR method[14,15] remains the only published method not requiring a subset of standards to be measured on main and remote instruments. Our recent work has given a new, robust FIR method free of the transfer artifacts[17] previously seen.

Making calibrations more robust generally involves identification and downweighting or removal of unwanted variation. Multivariate calibrations based on the usual soft modeling methods commonly used are inherently coupled to the instrument and the environment used during the calibration step. Thus, changing instruments or placing the new samples in a slightly different environment (e.g., measuring at a different temperature) can lead to errors from the predictive model. For this reason, it is common for process chemists to consider simpler modeling with one or more selected variables to monitor the process. There are many situations where selection of one or two variables from the many available will not permit useful modeling, however. Empirical selection of variables in NIR data is especially difficult because of the high degree of overlap and breadth of the spectral bands – the main reason for the use and the popularity of soft modeling approaches in this spectroscopic region.

1. Effective Removal of Unrelated Information in the Spectral Data

A second difficulty arises in those calibrations where, over time, changes in the raw materials used in the process or in the process itself leads to slight differences in the distribution of products generated. This change degrades the calibration because the variation captured and modeled in the calibration step now differs from that seen in the spectra coming from the process. Unlike transfer, however, the failure of calibrations in this case comes not from a change in the instrument or the environment of the measurement but from the very mixture subject to analysis and need not be low frequency, as the differences in spectra can be caused by differences in then composition of the process stream. Like transfer, the usual solution is a complete (and expensive) recalibration of the measurement to account for the new, interfering components. Alternatives to recalibration in this case have involved the automated selection of variables for prediction based on a calculation of real-time, weighted spectral residuals calculated point by point [18]. 

Some of the rationale behind the effort to make calibrations more robust has been laid out in a recent article [19] where approaches based on orthogonal signal correction (OSC) and wavelet multi-resolution are briefly discussed. The aim of OSC preprocessing of data is to remove unwanted instrumental and environmental effects through removal of all “non-relevant” components of the spectral data (that is, signal uncorrelated to the property) for a given response vector20-25. As discussed in recent papers regarding the validity of OSC [26-28], however, the various OSC algorithms published by Wold and others are actually based on the same mathematical fundamentals as PLS regression, and no real improvement in model performance is gained using OSC preprocessing. Wavelet preprocessing is also increasingly available in commercial software, but like OSC, there is no easy route to improved model performance by routine application of a wavelet transform. Like any variable selection scheme, multi-scale preprocessing with wavelets must be done carefully to avoid information leaks from the loss of useful analyte signal. The loss of information from careless application of variable selection often decreases rather than improves predictive performance in a calibration [29] or a classification.


Our current work on preprocessing of calibrations focuses on the key issues discussed above. 



2. Multiscale nature of spectral signals / multiscale calibration models 

The spectral vectors in a calibration can be viewed in multiscale way, since in practice they contain contributions with different frequency bands and at different wavelengths from a variety of sources, such as detector noise, instrumental disturbances and differences, temperature effects and sample variations etc. As demonstrated in Figure 1 below, the spectral vectors, specifically NIR spectra decomposed by the wavelet prism (WP) [29,30], can be interpreted as the summations of the frequency band components (scales) over wavelet frequency domain. The spectral variations related to the properties of interest appear mostly in the middle range of frequencies. Note that high scale corresponds to low frequency band. The multi-scale representation of spectral signals over the wavelet frequency domain provides a way to improve calibration models by isolating the irrelevant variations in a set of spectra [29]: e.g., by removing the low-frequency, non-constant background component and any high-frequency noise signals. In our recent work, we have also studied the ability of the information entropy to distinguish frequency differences between “background” and analyte in multi-resolution background removal, and found very good performance with synthetic and real data sets [29] in multivariate calibration and in classification.

The novel use of wavelets has many applications relevant to this project. One comes in multiscale calibration. Significant progress in the last year was achieved by our use of the wavelet transform to consider the multiscale nature of spectral data during calibration modeling [30,31]. Direct calibration by partial least squares (PLS) or principal component regression (PCR) explains spectral variance by using latent variables (or principal components) in a single scale (the raw wavelength) representation. As a result, more latent variables often must be used to explain local sources of variance, leading to unnecessarily complicated – and often overfit - calibration models. The alternative approach, involving selecting wavelengths or bands prior to modeling with PLS often leaks information because a small amount of critical analyte signal is embedded in the regions that are discarded.

Our new method, dual-domain regression analysis (DDRA), is aimed at dealing with these issues to reduce model complexity and to improve the life and performance of calibration models. It is a two-step procedure, conducted in a way similar to calibration using any of the common regression methods. The first step is to establish a multiscale model in a calibration set between the dependent vector y ( the property) and independent multiscale spectral tensor X {Xk, k = 1, 2, …, l+1} generated from the wavelet prism discussed above. The second step is to predict values for the dependent properties based on a test set tensor Xu = { X'(1,u) …. X'( l+1,u)}'. The goal of our multiscale regression analysis is to calculate the a vector of multiscale regression coefficients b with the lowest prediction error. An approximate solution to the equation implied in Figure 2 allows for a simple algorithm with very good performance. For example, separate PLS models, created for each frequency component of the multiscale spectra, are first determined, and the regression vectors in the multi-scale PLS regression matrix are then weighted according to a cross-validation of the separate PLS models on their corresponding frequency component according to the relation where sk is the reciprocal of the cross-validation error from the regression of the wavelet coefficients describing the kth wavelet scale in the calibration set on the property. In this way, the property-wavelet coefficient relationships at all frequencies contribute as appropriate to an overall time-domain model for property. We published one paper on this new method (it was highlighted by Chemometrics World earlier this year) and have a second paper now in press in Analytica Chimica Acta. Results for the second paper were presented at the Eighth Chemometrics in Analytical Chemistry Conference as a plenary lecture.



Our results on synthetic data sets, public-domain data and on other data suggest that significant improvement in both model size and in prediction errors can often be achieved by application of this simple modeling method as compared to conventional PLS with or without conventional wavelet processing with the discrete wavelet transform.

References

1. deNoord, O.E. Chemom. Intell. Lab. Syst. 1994, 25, 85.
2. Bouveresse, E. and Massart, D.L. Vib. Spectrosc. 1996, 11, 3.
3. Feudale, R.N.; Woody, N. A;. Myles, A. J.; Tan, H.-W.; Brown S.D.; Ferré, J. Chemom. Intell. Lab. Syst., 2002, 64, 181.
4. Wang, Y.D.; Veltkamp, D.J. and Kowalski, B.R. Anal. Chem. 1991, 63, 2750.
5. Wang, Y.D.; Lysaght, M.J. and Kowalski, B.R. Anal. Chem. 1992, 64, 562.
6. Bouveresse, E. and Massart, D.L. Chemom. Intell. Lab. Syst. 1996, 32, 201.
7. Sjöblom, J.; Svensson, O.; Josefson, M.; Kullberg, H. and Wold, S. Chemom. Intell. Lab. Syst. 1998, 44, 229.
8. Ozdemir, D.; Mosley, M.; Williams, R. Appl. Spectrosc. 1998, 52, 599.
9. Chen, C.S.; Brown, C.; Lo, S.C. Appl. Spectrosc. 1997, 51, 744.
10. Walczak, B.; Massart, D.L. Chemometrics Intell. Lab. Syst. 1997, 36, 81.
11. Tan, H.W.; Brown, S.D. J. Chemom. 2001, 15, 647.
12. Goodacre, R.; Timmins, E.M.; Jones, A.; Kell, D.B.; Maddock, J.; Heginbothom, M.L.; Magee, J.T. Anal. Chim. Acta, 1997, 348, 511.
13. Andrews, D.T.; Wentzell, P.D. Anal. Chim. Acta, 1997, 350, 341.
14. Blank, T.B.; Sum, S.T.; Brown, S.D. and Monfre, S.L. Anal. Chem. 1996, 68, 2987.
15. Sum, S.T. and Brown, S.D. Appl. Spectrosc. 1998, 52, 869.
16. Xie, Y.L.; Hopke, P.K. Anal. Chim. Acta, 1999, 384,193.
17. Tan, H.W.; Sum, S.T.; Brown, S.D. Appl. Spectrosc. 2002, 56, 1098.
18. Gemperline, P.J.; Cho, J.; Aldridge, P.K.; Sekulic, S.S. Anal. Chem. 1996; 68: 2913.
19. Brown, S.D.; Tan, H.-W.; Feudale, R. ACS Sym. Ser. in press, (2004).
20. Wold, S.; Antti, H.; Lindgren, F. and Öhman, J. Chemom. Intell. Lab. Syst. 1998, 44, 175.
21. Andersson, C.A. Chemom. Intell. Lab. Syst. 1999, 47, 51.
22. Wise, B.M.; Gallagher, N.B. http://www.eigenvector.com/MATLAB/OSC.html.
23. Fearn, T. Chemom. Intell. Lab. Syst. 2000, 50, 47.
24. Wold, S.; Trygg, J.; Berglund, A.; Antii, H. Chemom. Intell. Lab. Syst. 2001, 58, 131.
25. Trygg, J.; Wold, S. J. Chemom. 2002, 16, 119.
26. Fernández Pierna, J.A.; Massart, D.L.; de Noord, O.E.; Ricoux, Ph. Chemom. Intell. Lab. Syst. 2001, 55, 101.
27. Westerhuis, J.A.; de Jong, S.; Smilde, A.K. Chemom. Intell. Lab. Syst. 2001, 56, 13.
28. Svensson. O.; Kourti, T.; MacGregor J.F. J. Chemom. 2002, 16, 176.
29. Tan, H.W.; Brown, S.D., J. Chemom. 2003, 17, 111.
30. Tan, H.W.; Brown, S.D., J. Chemom. 2002, 16, 228.
31. Tan, H.W.; Brown, S.D., Analyt. Chim.Acta, 2003, 490, 291.

Go to Top of the Page



©2007 University of Delaware

Site Created by S.D. Brown       URL of this page: http://www.udel.edu/chemo/Links/chemo_res.htm
Last revision: 28 August 2007