Analysis of MALDI-TOF Data: from [601211]

Analysis of MALDI-TOF Data: from
Data Preprocessing to Model
Validation for Survival Outcome
Heidi Chen, Ph.D.
Cancer Biostatistics Center
Vanderbilt University School of Medicine
March 20, 2009

Outline
•M A L D I – T O F
• Data preprocessing for raw
spectra
• Build a prediction model from
training set
• Model validation

Biology 101
Genomics
Genetics Proteomics

MALDI-TOF
Step 1: Sample preparation

Principle Idea of MALDI-TOF MS
• Upon laser irradiation all molecules obtain
similar energy
• Convert electric energy to kinetic energy
• Time Of Flight (TOF) separates ions based on
size( mass/charge, m/z)
1/2TOF (m/z) :

Raw Spectra
0 0.5 1 1.5 2 2.5
x 104010002000300040005000600070008000
Mass (m/z)Intensity

•M A L D I – T O F
• Data preprocessing for raw
spectra
• Build a prediction model from
training set
• Model validation

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10400.511.522.533.54x 104 before calibration of UnSpiked spectra
Baseline correction
Denoise
Normalization
CalibrationFeature
Detection

Peak Calibration
0 0.5 1 1.5 2 2.5
x 104050010001500200025003000
(1) m/z values around some known proteins
(2) show clear bell-shape

Convolution Based Calibration
• Calibrate each spectrum with
the known peaks. Max h(t)
happens when f and g overlap
the most.
• The optimum shift is obtained
by maximizing the sum of convolution values on the
multiple peak locations.
Note: all process are on the
time domain.
22
11() ( * ) () ( ) ( ) ( ) ()
ttttht f g t f gt d f t gtdτττ τ τ =≡ − = −∫∫f(t) : observed peak
g(t) : ideal peak (normal distribution) original max
original max
original maxshift to right : t < t
shift to left : t < t
keep the same : t = t

Calibration
Before Calibration After Calibration

Wavelet Denoising
High frequency Low frequency
S = A + DWavelet decomposition

Wavelet Decomposition Tree
S
A1 D1
A2 D2
A3D3
S = A1 + D1
= A2 + D2 + D1
= A3 + D3 + D2 + D1

Wavelet Denoise
Remove noise by
thresholding
D1D2D3D4D5

Baseline Correction
Spline curve to fit the local minima

Peak Detection
3500 4000 4500 5000 5500 6000 65000100020003000400050006000(1) local maxima
(2) pass signal/noise cutoff to filter out small peaks

calibration

Peak Distribution

Common Peaks Finding
5200 5400 5600 5800 6000 6200 6400051015202530354045
Mass (m/z)Number of
SpectraPeak location : local maximum
Boundaries : adjacent local minima
Filter : > 5% of spectra show peaks
Kernel Density of Peaks Distribution

Feature Quantification
• Normalization : standardize the AUC for all
spectra to the median AUC
• Peak intensity: AUC within peak
boundaries

Threshold Selection
• Denoise: wavelet threshold
• Peak detection: signal to noise ratio
• Common peak finding: bandwidth of KDS
0 2 4 6 8 10 12 14 1600.050.10.150.20.25
m/z in 1000 daltonsnoise/signal ratio
5500 6000 6500 7000 7500 8000050010001500200025003000

Training Set
• 35 pretreatment (EGFR+VEGF) serum
samples from stage IIIB/IV NSCLC; 14
male, 21 female; age range 36-72
• Experimental Design
Day1 Day2 Day3
Blood sample pt1-pt35 pt1-pt35 pt1-pt35
Replication 3 replications
each pt3 replications
each pt3 replications
each pt
105 samples were randomly spotted in two 64-well
plates each day.

Training Set
290 good spectra; 25 bad spectra
174 features (3000-20,000 mz) after data
preprocessing

Variance Components

•M A L D I – T O F
• Data preprocessing for raw
spectra
• Build a survival prediction
model from training set
• Model validation

Procedure of Constructing Survival
Prediction Model from Training Set
(1) Feature selection
– CPH model
– FDR cutoff 0.05 : 11 features associated with
survival
(2) Create a compound score as a prediction
index.
(3) Prediction model : predicted hazard rate, ,0 ,
, ( | ) ( ) exp( )
: intensity for feature of patient ij ij j j i
jiht x h t x
x ji
β =
,
1 (sign of ) k
ij jji
js wx
β
==∑ statistics Wald : jw
0 ( | ) ( ) exp(0.0235 )ii i hts h t s =

•M A L D I – T O F
• Data preprocessing for raw
spectra
• Build a prediction model from
training set
• Model validation

Overfitting
Training setIndependent test set

Model Validation
Goal : evaluate how the model
performs in the future dataset
• External validation: independent test
set
• Internal validation: training set

External Validation
Data Preprocessing for Independent Test
Set
• Calibration
• Baseline correction
• Wavelet denoise
• Normalization : median AUC from training
set
• Peak location and boundaries from
training set

Prediction for Independent Test Set
Compound score as a predictor
,
1 (sign of ) k
ij jji
js wx
β
=′′=∑
CPH : predicted hazard rate12 Freeze , , … , and from training setk ww wϕ
Association with
survival outcome.
0 ( ) ( ) exp(0.0235 )iiht ht s′′=

Validation of Predictive Model
• What to validate ?
– Predictive ability
• C-index ( Harrel et al. 1982) : measure the
agreement between predicted and
observed survival time for two subjects.
C-index ranges from 0 to 1.
1: perfect prediction; 0.5 :random prediction;
0: opposite prediction

Independent Test Sets
• Eastern Cooperative Oncology Group
(ECOG) (n=82)
• Italian (n=66)

KM Survival
<121
[121, 127)
[127,143)
>173 [143,173)ECOG (n=82)
Cox model : p < 0.001(n=61)(n=35)
(n=66)

C-index
Training set 0.77
Test set 0.62 (Italian)
Training C-index : the C-index on the training set
Generalized C-index : the C-index on the independent test set

Internal Validation
Goal : estimate the generalized C-index
through the internal validation
process
• Data splitting
• K-fold cross validation
• Bootstrap
Focus on the procedure after data preprocessing

Data Splitting
(1) Build a prediction model based on
training set
(2) Compound score for test set: winners
and Wald statistics from training set
(3) Generalized C-index: calculate C-
index from test setTraining Test

K-fold Cross Validation
Sample size n
divided into K = 5 parts,
Combine C-index from all test sets to get the
estimate the generalized C-indexC-index for each test set

Bootstrap
1BC
mBC1TC
Bootstrap training
Samples (n)
Sampling with replacementBootstrap test sample : obs not
in bootstrap training
Original training set (n)1B
2BmB …….……
1B-2B
mB–

Bootstrap
• Generalized C-index
0.5 index) – C e informativ – (non =
γ=+) (0.632C
C-index(0.632+) ranges from C-index(0.632) if there is minimum
overfitting (R=0) to if there is maximum overfitting (R=1)1
()
1(1 )
im
training T test m
iWC W C
=−+ ∑
0.632
10 . 3 6 8 ( ) [0.632, 1]R Ww e i g h t w−=∈1
()
1
– R ( relative overfitting rate ) : R [0, 1]m
Tt e s t t r a i n i n g m ii
trainingCC
C
γ=−∑


=m
itest T miC
1) (1

C-index
Method NSCLC
(n=35)Overfit Example
(n=77)
Training set 0.77 0.69
Bootstrap 0.71 0.53
Indep test set 0.62 (Italian) 0.28

Reproducibility of MALDI-TOF MS
Winners of Case Study (11)
EGFR+VEGFWinners of JNCI 2007 (8)
EGFR
4121 5843
4596 11445
4720 11529
4821 11685
5720 11759
5841 11903
11441 12452
11528 12579
11684
11731
11902

Take Home Message
• Good experimental design
• Precisely follow the protocol of MALDI-
TOF
• MALDI-TOF can detect the true signal

Acknowledgements
Stuart Salmon Dean Billheimer
Shuo Chen Yu Shyr
Roy Herbst Ju-Whei Lee
Anne Tsao Pierre Massion
Hai Tran Julie Brahmer
Alan Sandler Joan Schiller
David Carbone Thao P. Dang

Similar Posts