|
Novel
Methods for Improved Classification of High-Dimensional Data Sets
Computer-aided classification of gas chromatographic signatures is
being used to analyze complex chromatographic data from biological systems.
Modern-day gas chromatograms often yield information that can be exceedingly
complex. It is not unusual to see dozens and, in some cases, even
hundreds of components resolved by highly efficient fused silica capillary
columns. Important information contained in such complex chromatograms
is not readily interpretable without extensive use of advanced computational
techniques.
Standard statistical techniques, such as linear and quadratic discriminant
analyses (LDA and QDA), have a sound foundation in statistics, and can
be interpreted relatively easily. These methods are commonly used to help
in the computer-aided classification of signatures from chemical instrumentation.
These discriminant methods, as noted above, make assumptions about the
underlying class structures of the data [1,2], however. Conventional data
analysis, using either ad hoc methods, or more conventional statistical methods
such as PCA compression followed by discriminant analysis are critically
dependent on the careful preprocessing of the data to warp the decision space
in a way that reflects differences in data origins, in the best cases. Because
of this limitation, getting a good idea of the presence of outliers or determining
the uncertainty in a classification has generally been neglected. Pursuing
optimal classification has usually led to a focus on the preprocessing step
as a way to help reduce the space of the data without losing too much of
the information carried in the chromatograms or introducing too much bias
to the data analysis through assumptions as to the underlying form of the
classes believe present in the dataset. A problem with this approach is
that the emphasis is usually placed on separation of classes and not in
an assessment of the uncertainty in any sample’s class assignment or on
the reliable detection of outliers. There is also no easy way to relate
the measured data to the class assignment and its uncertainty, or to incorporate
expert, external information into the classification.
Other classification techniques, including the Bayes’ classifier,
Soft Independent Modeling of Class Analogy (SIMCA), and neural networks
make fewer assumptions about the intrinsic nature of the classes in the
data, but can require larger amounts of data to build adequately predictive
models [1,3]. Tree-based methods such as C4.5 or Classification and Regression
Trees (CART) make no assumptions about class structure within the data and
do not require large amounts of data, but traditionally have been restricted
to making binary partitions of the data space in order to separate the classes[1,4].
While the methods of Bayesian statistics have long been a mainstay
of data analysis in the areas of economics, social sciences and artificial
intelligence, they have only recently been applied to chemical and biological
problems [5-7]. The methods are not at all new. Inference and learning
in Bayesian systems is a well documented area with many excellent sources
[8-11]. Buntine [9] presents an excellent literature review of learning,
while Pearl [11] provides much of the basis for computation in these systems.
Here, it will suffice to explain what is meant by these topics. Inference
is the process of turning local conditional probabilities into globally
consistent tables and calculating posterior probabilities for nodes in the
network. Learning is the determination of the parameters of the network
from data, means and covariances in the continuous case and probabilities
in the discrete case. A Bayes classifier, shown in Figure 1, is a simple
one-layer representation of a probabilistic discriminant. This provides
a probabilistic model with limited capacity, as it can only represent linear
relationships.
The directions of the arcs represent the causation of the value of
X by the values of the evidence nodes Pa(X). This Figure describes
a standard pattern-recognition/classification problem. Given a dataset
of variables and their classification, the Bayesian classifier can be used
to model the inherent relationships in the data and, given a test case,
predict the classification of that sample with a level of certainty represented
by the probability of that X takes on class value i. A Bayesian classifier
has several interesting features which we are interested in exploiting:
1) A probability of class assignment is returned instead of just a class
assignment; 2) Missing data is handled through an inference procedure where
the missing value is filled with E(Xk|e), where e refers to the current
set of evidence; 3) Given any subset of evidence and a consistent network
the probability of all other values can be calculated.
The single-layer structure of the Bayesian classifier limits its usefulness
and flexibility to linear discrimination. Bayesian networks, on the
other hand, allow for complex multi-layer interactions which are more able
to model non-linear and concave discriminants. The second important
feature of Bayesian network is that it contains fewer parameters then a
Bayesian classifier due to its more sparsely connected structure.
The classifier in Figure 1 requires 22 parameters (if all nodes are binary)
to be specified. Figure 2 demonstrates a possible
alternative Bayesian network structure for the same problem, but this
structure requires only 14 parameters to represent the same number of
binary nodes. This difference becomes more dramatic as the number
of evidence nodes increases, the situation present in a complex data set
such as that seen in chromatography. There are many different ways of
learning this structure. The most common structure learning algorithm
is based on the EM parameter learning algorithm, termed Structural EM [10].
However, learning the optimal structure of a Bayesian network is not a trivial
problem and is computationally intensive. Once the structure is learned,
the network classifies data very rapidly.
We have explored ways in which Bayes’ methods described above and
decision tree methods such as CART might be combined to make improvements
in the classification of high-dimension, complex systems in chemistry.
We developed a novel hybrid method, combining a classifier based on Bayesian
statistics and a CART variable selection step for analysis of small urine
samples from infants to screen for about 20 metabolic birth defects using
their chemical signatures in gas chromatographic profiles. The classifier
we developed for this chemical analysis is shown in Figure 3.
Figure 3. Logical representation of a Bayesian network for classification
of infant metabolic defects from gas chromatographic data. Blue ovals represent
chromatographic peaks, grey ovals represent metabolic diseases and magenta
ovals represent biochemical class of disease. The central grey oval indicates
that there is/is not a metabolic defect present from peak data. Data are
classified by loading the chromatographic data into the blue ovals and
observing the values in the grey and magenta ovals.
This is an entirely new method for classification that has similarities
to a Bayesian network. Unlike these classifiers, though, our method does
not require an expert to provide the rules to the classifier for the structure
learning step. Our methodology is independent of the type of data available
or the problem to be solved, and it should be applicable to a wide variety
of problems. Our hybrid classifier used CART to select highly informative
variables for use in a Bayes’ classifier or a Bayes’ network. Through this
combination, we gained an ability to discriminate outliers and to estimate
the uncertainty of a classification result.
More recently, we have worked to improve the CART selection step of
the hybrid classifier by investigating different ways to improve decision
rules, including use of fuzzy partitions [12-14]. We have also investigated
the performance of the hybrid classifier in dealing with large amounts
of missing data in a data set [15], including an analysis of a well-known
GC fuelspill dataset, and now are putting a Bayes classifier of
the sort described above to the test with very large sets of complex GC-MS
data from oil prospecting measurements. We have also developed a novel
way of using classifiers at stem and leaf nodes in CART and have demonstrated
its use in classification [16].
References
1. B. Ripley, Pattern Recognition and Neural Networks,
Cambridge University Press, New York (1996).
2. G. McLachlan, Discriminant Analysis and Statistical
Pattern Recognition, Wiley, New York (1992).
3. M. Sharaf, D. Illman, and B. Kowalski, Chemometrics,
Wiley, New York (1986).
4. W. Venables, B. Ripley, Modern Applied Statistics
with S-Plus, Springer- Verlag, New York (1994).
5. K.L. Mello and S.D. Brown, J. Chemometrics, 1999,
13, 579-590.
6. K.L. Mello and S.D. Brown, “Combining Recursive
Partitioning and Uncertain Reasoning for Data Exploration and Characteristic
Prediction,” Proc. AAAI Symposium on Predictive Toxicology, 1999, 119-122.
7. K.L. Mello and S.D. Brown, “System for Discovering
Implicit Relationships in Data and a Method of Using the Same,” US Patent
No. 6466929, Issued 10/15/02.
8. D. Heckerman, “Learning Probabilistic Networks,”
Microsoft Research Technical Report MSR-TR-95-06.
9. W. L. Buntine, Journal of Artificial Intelligence
Research 1994, 2, 159-225.
10. A.P Dempster, N.M. Laird, D.B. Rubin. Maximum Likelihood
from Incomplete Data via the SEM algorithm. Journal of the Royal Statistical
Society, Series B, 39(1):1-38.
11. J. Pearl. Probabilistic Reasoning in Intelligent Systems.
Morgan Kaufmann, San Mateo, California, 1988.
12. Y. Yuan and M.J. Shaw, Fuzzy Sets and Systems 1995, 60, 125-139.
13. A. Suarez and J.F. Lutsko, IEEE Trans. on Pattern Analysis and
Machine Intell. 1999, 21, 1297-1311.
14. A.J. Myles and S.D. Brown, J. Chemometrics, 2003, 17, 531-536.
15. N.A. Woody and S.D. Brown, J. Chemometrics, 2003, 17,266-273.
16. N.A. Woody, J.T. Mondick, and S.D. Brown, J. Chem. Inf. Comp. Sci., submitted (2007).
Go to Top of Page
|
|
Data and Model
Fusion for Multivariate Classification and Calibration
It has long been recognized that making a chemical measurement
provides a reduction of uncertainty as to the identity or amount of a chemical
analyte, and that different measurement systems reduce that uncertainty
by varying amounts. Eckschlager [1] and Winefordner [2] characterize an
analytical measurement in terms of the information gain - the net change
in uncertainty - afforded by the measurement. It was quickly realized that
multivariate measurements offered the largest potential gains. Hirschfeld
[3] also saw the relationship between measurement and information gain when
he stressed the potential advantage of multidimensional, multivariate (hyphenated)
analyses. Kowalski and others [4] pointed out the second-order advantage of
some of these analyses. Now, some of the applications of chemometrics to
“two-way” and higher dimensional measurement systems are well enough established
to have reached application status.
There is no absolute requirement from information theory that the
measurements be multidimensional to provide additional information, however.
Much of the same information is, in principle, available from a one-dimensional
vector made of separate chemical measurements. Multiple measurements made
on the same sample are now increasingly feasible and affordable. It costs
relatively little to collect another set of spectra on the same sample, especially
if that sample is a part of a flowing stream in some chemical process. The
problem lies in accessing that additional information. Repeated efforts
have demonstrated that merely stringing spectra of different kinds together
seldom improves a multivariate analysis. A similar problem exists with the
combination of chemical sensors of different types to yield a fused signal.
The use of multiple, high-throughput, nonspecific measurements done on a
sample with subsequent mathematical analysis of those disparate outputs as
though they were a single response is called data or sensor fusion. Analysis
of fused data approximates the cognitive process used by humans to make
inferences about their environment from disparate data available from their
senses. There are several types of data fusion methods reported in the literature
of artificial intelligence:
Low-level fusion. Here, the multiple responses for a sample are simply
concatenated and processed as if they were a single signal. Because of
its appealing simplicity, this is the most commonly tried form of data fusion
by chemists. However, due to the difficulty in dealing with raw spectral
data’s redundancies and noise by conventional chemometric means, there
has been limited success from low-level fusion efforts.
Mid-level fusion. In this sort of data fusion, the aim is to extract
a few, pertinent features from the fused multivariate data, then process
them as a kind of “meta-signal”, rather than the string of fused data.
This approach has not seen much study in chemometrics, as it is difficult
to extract reliable features from typical spectra with any sense of their
reliability, but it has received a good deal of attention in military applications.
Given new capabilities in reducing irrelevant signal and background, mid-level
fusion may have some promise for analysis of multiple chemical responses.
High-level fusion. This approach to data fusion has mostly focused
on classification. In high-level fusion, the class outputs of individual
sensor classifiers – often called the identity declaration - are the signals
being subjected to fusion and analysis. Because class assignments and not
analytical signals are being combined, the approach is suited to all types
of measurements. High level fusion is often implemented using either heuristics
(simple voting of the classifiers) or through probability estimation methods.
One way to perform high-level fusion is to use Bayesian inference [5]. Another
is based on possibility theory, which is a generalization of Bayesian methods
for data with high uncertainty [6].
Data fusion can also be thought of in terms of the merging of databases
from different sources into a single database. In combining databases,
there may be missing individual values or, on occasion, missing variables,
in that variables in one database may not appear in the second database
to be merged with the first. This aspect of data fusion, then, deals with
the imputation of missing values in partially shared data tables using Bayesian
estimation from conditional probability tables [7]. That sense of data
fusion, while useful, especially in database-intensive applications such
as those in marketing and bioinformatics, is not the focus of this proposal,
though it is an obvious extension of work mentioned here.
Researchers in chemometrics have, in the main, taken pains to avoid
fusing data except for the analysis of some limited sensor array data where
the sensors are generally similar in behavior. There is a recent paper
in the classification of a small group of French wines from fused chemical
data using low and high-level data fusion methods. In this work, the authors
got little improvement from a low-level fusion, but when they used simple
Bayesian classification, they were able to reduce the error rate for classification
by as much as 10-fold [5].
Recent work in this group has encouraged us to undertake research
in the analysis of fused data from typical measurements made on chemical
systems because our expertise in the induction of fuzzy classifiers, especially
those with leaf node classifiers, are ideally suited to mid- and high-level
fusion efforts. Further, Bayesian networks and inference also have direct
application, offering real advantages over the more limited Bayesian classifiers
used by others. Our expertise in using Bayesian networks, especially those
with continuous nodes, should also be beneficial. With our recent advances
in methodology for isolating local effects relevant to the analyte, including
methods developed in a parallel project, the future of fusion of chemical
data appears bright.
References
1. Eckschlager, K.; Danzer, K. Information Theory in Analytical Chemistry,
John Wiley (Pfeiffer), 1994.
2. Fitzgerald, J.; Winefordner, J.D. Rev. Anal. Chem., 1975, 2(4)
299.
3. Hirschfeld, T. Analyt. Chem., 1980, 52(2), 297A.
4. Fleming, C. ; Kowalski B.R. In Encyclopedia of Analytical Chemistry.
Myers, R.A., Ed. v. 11, Wiley and Sons: Chichester, UK, 9736-64 (2000).
5. S. Roussel, S.; Bellon-Maurel, V.; Roger, J.-M.; Grenier ,P. Chemom.
Intell. Lab. Syst. 2003, 65, 209.
6. Caraux, G.; Lechevallier, Y. Artif. Intell. Rev. 1996, 10, 219.
7. Saporta,G. Comp. Stat. Data Analysis 2002, 38, 465.
Go to Top of Page
|
Improving
the Robustness and Transfer of Multivariate Calibrations
.A significant limitation to the use of multivariate regression in
NIR spectrometry and elsewhere is the difficulty with which a predictive multivariate
calibration model developed from spectra under one set of environmental and
instrumental conditions is used on similar spectra obtained under different
conditions, a process known as “standardization” or “calibration transfer”
in the literature of multivariate calibration [1,2]. We have recently reviewed
the literature of multivariate calibration transfer in an extensive, critical
review [3], covering, among others, approaches aimed at correcting for
the unintended instrumental variation incorporated in spectra and subsequently
imbedded in the multivariate calibration model. These approaches span the
use of simple multivariate regression [4-6], orthogonal signal correction
(OSC) [7], genetic algorithms (GA) [8], Fourier and wavelet transforms
[9-11], neural networks (NN)[12], maximum likelihood principal component
analysis (MLPCA)[13], finite impulse response (FIR) filtering [14,15], and
positive matrix factorization (PMF) [16]. Our FIR method[14,15] remains
the only published method not requiring a subset of standards to be measured
on main and remote instruments. Our recent work has given a new, robust FIR
method free of the transfer artifacts[17] previously seen.
Making calibrations more robust generally involves identification
and downweighting or removal of unwanted variation. Multivariate calibrations
based on the usual soft modeling methods commonly used are inherently coupled
to the instrument and the environment used during the calibration step.
Thus, changing instruments or placing the new samples in a slightly different
environment (e.g., measuring at a different temperature) can lead to errors
from the predictive model. For this reason, it is common for process chemists
to consider simpler modeling with one or more selected variables to monitor
the process. There are many situations where selection of one or two variables
from the many available will not permit useful modeling, however. Empirical
selection of variables in NIR data is especially difficult because of
the high degree of overlap and breadth of the spectral bands – the main
reason for the use and the popularity of soft modeling approaches in this
spectroscopic region.
1. Effective Removal of Unrelated Information in the
Spectral Data
A second difficulty arises in those calibrations where, over time,
changes in the raw materials used in the process or in the process itself
leads to slight differences in the distribution of products generated. This
change degrades the calibration because the variation captured and modeled
in the calibration step now differs from that seen in the spectra coming
from the process. Unlike transfer, however, the failure of calibrations
in this case comes not from a change in the instrument or the environment
of the measurement but from the very mixture subject to analysis and need
not be low frequency, as the differences in spectra can be caused by differences
in then composition of the process stream. Like transfer, the usual solution
is a complete (and expensive) recalibration of the measurement to account
for the new, interfering components. Alternatives to recalibration in this
case have involved the automated selection of variables for prediction based
on a calculation of real-time, weighted spectral residuals calculated point
by point [18].
Some of the rationale behind the effort to make calibrations more
robust has been laid out in a recent article [19] where approaches based
on orthogonal signal correction (OSC) and wavelet multi-resolution are briefly
discussed. The aim of OSC preprocessing of data is to remove unwanted instrumental
and environmental effects through removal of all “non-relevant” components
of the spectral data (that is, signal uncorrelated to the property) for
a given response vector20-25. As discussed in recent papers regarding the
validity of OSC [26-28], however, the various OSC algorithms published by
Wold and others are actually based on the same mathematical fundamentals
as PLS regression, and no real improvement in model performance is gained
using OSC preprocessing. Wavelet preprocessing is also increasingly available
in commercial software, but like OSC, there is no easy route to improved
model performance by routine application of a wavelet transform. Like any
variable selection scheme, multi-scale preprocessing with wavelets must
be done carefully to avoid information leaks from the loss of useful analyte
signal. The loss of information from careless application of variable selection
often decreases rather than improves predictive performance in a calibration
[29] or a classification.
Our current work on preprocessing of calibrations focuses on the
key issues discussed above.
2. Multiscale nature of spectral signals / multiscale
calibration models
The spectral vectors in a calibration can be viewed in multiscale
way, since in practice they contain contributions with different frequency
bands and at different wavelengths from a variety of sources, such as detector
noise, instrumental disturbances and differences, temperature effects
and sample variations etc. As demonstrated in Figure 1 below, the spectral
vectors, specifically NIR spectra decomposed by the wavelet prism (WP)
[29,30], can be interpreted as the summations of the frequency band components
(scales) over wavelet frequency domain. The spectral variations related
to the properties of interest appear mostly in the middle range of frequencies.
Note that high scale corresponds to low frequency band. The multi-scale
representation of spectral signals over the wavelet frequency domain provides
a way to improve calibration models by isolating the irrelevant variations
in a set of spectra [29]: e.g., by removing the low-frequency, non-constant
background component and any high-frequency noise signals. In our recent
work, we have also studied the ability of the information entropy to distinguish
frequency differences between “background” and analyte in multi-resolution
background removal, and found very good performance with synthetic and real
data sets [29] in multivariate calibration and in classification.
The novel use of wavelets has many applications relevant to this project.
One comes in multiscale calibration. Significant progress in the last year
was achieved by our use of the wavelet transform to consider the multiscale
nature of spectral data during calibration modeling [30,31]. Direct calibration
by partial least squares (PLS) or principal component regression (PCR)
explains spectral variance by using latent variables (or principal components)
in a single scale (the raw wavelength) representation. As a result, more
latent variables often must be used to explain local sources of variance,
leading to unnecessarily complicated – and often overfit - calibration
models. The alternative approach, involving selecting wavelengths or bands
prior to modeling with PLS often leaks information because a small amount
of critical analyte signal is embedded in the regions that are discarded.
Our new method, dual-domain regression analysis (DDRA), is aimed at
dealing with these issues to reduce model complexity and to improve the
life and performance of calibration models. It is a two-step procedure,
conducted in a way similar to calibration using any of the common regression
methods. The first step is to establish a multiscale model in a calibration
set between the dependent vector y ( the property) and independent multiscale
spectral tensor X {Xk, k = 1, 2, …, l+1} generated from the wavelet prism
discussed above. The second step is to predict values for the dependent
properties based on a test set tensor Xu = { X'(1,u) …. X'( l+1,u)}'. The
goal of our multiscale regression analysis is to calculate the a vector of
multiscale regression coefficients b with the lowest prediction error. An
approximate solution to the equation implied in Figure 2 allows for a simple
algorithm with very good performance. For example, separate PLS models, created
for each frequency component of the multiscale spectra, are first determined,
and the regression vectors in the multi-scale PLS regression matrix are then
weighted according to a cross-validation of the separate PLS models on their
corresponding frequency component according to the relation where sk is
the reciprocal of the cross-validation error from the regression of the
wavelet coefficients describing the kth wavelet scale in the calibration
set on the property. In this way, the property-wavelet coefficient relationships
at all frequencies contribute as appropriate to an overall time-domain model
for property. We published one paper on this new method (it was highlighted
by Chemometrics World earlier this year) and have a second paper now in press
in Analytica Chimica Acta. Results for the second paper were presented at
the Eighth Chemometrics in Analytical Chemistry Conference as a plenary lecture.
Our results on synthetic data sets, public-domain data and on other
data suggest that significant improvement in both model size and in prediction
errors can often be achieved by application of this simple modeling method
as compared to conventional PLS with or without conventional wavelet processing
with the discrete wavelet transform.
References
1. deNoord, O.E. Chemom. Intell. Lab. Syst. 1994, 25, 85.
2. Bouveresse, E. and Massart, D.L. Vib. Spectrosc. 1996, 11, 3.
3. Feudale, R.N.; Woody, N. A;. Myles, A. J.; Tan, H.-W.; Brown S.D.;
Ferré, J. Chemom. Intell. Lab. Syst., 2002, 64, 181.
4. Wang, Y.D.; Veltkamp, D.J. and Kowalski, B.R. Anal. Chem. 1991,
63, 2750.
5. Wang, Y.D.; Lysaght, M.J. and Kowalski, B.R. Anal. Chem. 1992,
64, 562.
6. Bouveresse, E. and Massart, D.L. Chemom. Intell. Lab. Syst. 1996,
32, 201.
7. Sjöblom, J.; Svensson, O.; Josefson, M.; Kullberg, H. and
Wold, S. Chemom. Intell. Lab. Syst. 1998, 44, 229.
8. Ozdemir, D.; Mosley, M.; Williams, R. Appl. Spectrosc. 1998, 52,
599.
9. Chen, C.S.; Brown, C.; Lo, S.C. Appl. Spectrosc. 1997, 51, 744.
10. Walczak, B.; Massart, D.L. Chemometrics Intell. Lab. Syst. 1997,
36, 81.
11. Tan, H.W.; Brown, S.D. J. Chemom. 2001, 15, 647.
12. Goodacre, R.; Timmins, E.M.; Jones, A.; Kell, D.B.; Maddock, J.;
Heginbothom, M.L.; Magee, J.T. Anal. Chim. Acta, 1997, 348, 511.
13. Andrews, D.T.; Wentzell, P.D. Anal. Chim. Acta, 1997, 350, 341.
14. Blank, T.B.; Sum, S.T.; Brown, S.D. and Monfre, S.L. Anal. Chem.
1996, 68, 2987.
15. Sum, S.T. and Brown, S.D. Appl. Spectrosc. 1998, 52, 869.
16. Xie, Y.L.; Hopke, P.K. Anal. Chim. Acta, 1999, 384,193.
17. Tan, H.W.; Sum, S.T.; Brown, S.D. Appl. Spectrosc. 2002, 56, 1098.
18. Gemperline, P.J.; Cho, J.; Aldridge, P.K.; Sekulic, S.S. Anal.
Chem. 1996; 68: 2913.
19. Brown, S.D.; Tan, H.-W.; Feudale, R. ACS Sym. Ser. in press, (2004).
20. Wold, S.; Antti, H.; Lindgren, F. and Öhman, J. Chemom. Intell.
Lab. Syst. 1998, 44, 175.
21. Andersson, C.A. Chemom. Intell. Lab. Syst. 1999, 47, 51.
22. Wise, B.M.; Gallagher, N.B. http://www.eigenvector.com/MATLAB/OSC.html.
23. Fearn, T. Chemom. Intell. Lab. Syst. 2000, 50, 47.
24. Wold, S.; Trygg, J.; Berglund, A.; Antii, H. Chemom. Intell. Lab.
Syst. 2001, 58, 131.
25. Trygg, J.; Wold, S. J. Chemom. 2002, 16, 119.
26. Fernández Pierna, J.A.; Massart, D.L.; de Noord, O.E.;
Ricoux, Ph. Chemom. Intell. Lab. Syst. 2001, 55, 101.
27. Westerhuis, J.A.; de Jong, S.; Smilde, A.K. Chemom. Intell. Lab.
Syst. 2001, 56, 13.
28. Svensson. O.; Kourti, T.; MacGregor J.F. J. Chemom. 2002, 16,
176.
29. Tan, H.W.; Brown, S.D., J. Chemom. 2003, 17, 111.
30. Tan, H.W.; Brown, S.D., J. Chemom. 2002, 16, 228.
31. Tan, H.W.; Brown, S.D., Analyt. Chim.Acta, 2003, 490, 291.
Go to Top of the Page
|