Discriminant Analysis of Metabolomics Data
Essay Preview: Discriminant Analysis of Metabolomics Data
Report this essay
1.1 Aims
Metabolomics is a post genomic technology that seeks to provide a comprehensive profile of all metabolites present in a biological sample, the aim of which is to provide information about the organism or tissue under investigation. Characteristics in the 1H NMR spectra can be used for classification and comparison of samples. The identification and use of such traits is known as metabolic fingerprinting. However, the samples often contain thousands of metabolites making the NMR spectra extremely complex and difficult to interpret.
Current methods for preprocessing 1H NMR metabolomics data (data reduction and shift correction) add intra-class variation and lead to a loss of interpretability. Popular methods for classification of metabolomics data include PCA-LDA and PLS-LDA, the results obtained, although providing good classification rate, are often difficult to interpret due to the inherent data transformation used. The aim of this project was to develop new analytical techniques with improved classification results and the ability to identify the metabolites responsible for the separation.
1.2 Methods
Standard methods for binning divide the selected spectral range into regions of designated size (usually 0.04 ppm). This standard binning method often splits peaks between bins or designating the same peak to different bins across the samples, leading to increased intra-class variation. The novel preprocessing method, adaptive binning, uses the undecimated wavelet transform at a predefined wavelet level to find all minima in the reference spectra at that level of approximation. The beginning and end of each peak (at this level of approximation) is therefore identified and each individual spectrum is then transformed by summing the intensities between the minima, thereby selecting the bin range to be that of the entire peak across samples.
NMR spectra consist of a large number of variables, in these studies this ranges from 1024 to 16384 data points. Having a search space this large with relatively few samples (