Analysis of MALDI-TOF Data: from [601211]
Analysis of MALDI-TOF Data: from
Data Preprocessing to Model
Validation for Survival Outcome
Heidi Chen, Ph.D.
Cancer Biostatistics Center
Vanderbilt University School of Medicine
March 20, 2009
Outline
•M A L D I – T O F
• Data preprocessing for raw
spectra
• Build a prediction model from
training set
• Model validation
Biology 101
Genomics
Genetics Proteomics
MALDI-TOF
Step 1: Sample preparation
Principle Idea of MALDI-TOF MS
• Upon laser irradiation all molecules obtain
similar energy
• Convert electric energy to kinetic energy
• Time Of Flight (TOF) separates ions based on
size( mass/charge, m/z)
1/2TOF (m/z) :
Raw Spectra
0 0.5 1 1.5 2 2.5
x 104010002000300040005000600070008000
Mass (m/z)Intensity
•M A L D I – T O F
• Data preprocessing for raw
spectra
• Build a prediction model from
training set
• Model validation
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10400.511.522.533.54x 104 before calibration of UnSpiked spectra
Baseline correction
Denoise
Normalization
CalibrationFeature
Detection
Peak Calibration
0 0.5 1 1.5 2 2.5
x 104050010001500200025003000
(1) m/z values around some known proteins
(2) show clear bell-shape
Convolution Based Calibration
• Calibrate each spectrum with
the known peaks. Max h(t)
happens when f and g overlap
the most.
• The optimum shift is obtained
by maximizing the sum of convolution values on the
multiple peak locations.
Note: all process are on the
time domain.
22
11() ( * ) () ( ) ( ) ( ) ()
ttttht f g t f gt d f t gtdτττ τ τ =≡ − = −∫∫f(t) : observed peak
g(t) : ideal peak (normal distribution) original max
original max
original maxshift to right : t < t
shift to left : t < t
keep the same : t = t
Calibration
Before Calibration After Calibration
Wavelet Denoising
High frequency Low frequency
S = A + DWavelet decomposition
Wavelet Decomposition Tree
S
A1 D1
A2 D2
A3D3
S = A1 + D1
= A2 + D2 + D1
= A3 + D3 + D2 + D1
Wavelet Denoise
Remove noise by
thresholding
D1D2D3D4D5
Baseline Correction
Spline curve to fit the local minima
Peak Detection
3500 4000 4500 5000 5500 6000 65000100020003000400050006000(1) local maxima
(2) pass signal/noise cutoff to filter out small peaks
calibration
Peak Distribution
Common Peaks Finding
5200 5400 5600 5800 6000 6200 6400051015202530354045
Mass (m/z)Number of
SpectraPeak location : local maximum
Boundaries : adjacent local minima
Filter : > 5% of spectra show peaks
Kernel Density of Peaks Distribution
Feature Quantification
• Normalization : standardize the AUC for all
spectra to the median AUC
• Peak intensity: AUC within peak
boundaries
Threshold Selection
• Denoise: wavelet threshold
• Peak detection: signal to noise ratio
• Common peak finding: bandwidth of KDS
0 2 4 6 8 10 12 14 1600.050.10.150.20.25
m/z in 1000 daltonsnoise/signal ratio
5500 6000 6500 7000 7500 8000050010001500200025003000
Training Set
• 35 pretreatment (EGFR+VEGF) serum
samples from stage IIIB/IV NSCLC; 14
male, 21 female; age range 36-72
• Experimental Design
Day1 Day2 Day3
Blood sample pt1-pt35 pt1-pt35 pt1-pt35
Replication 3 replications
each pt3 replications
each pt3 replications
each pt
105 samples were randomly spotted in two 64-well
plates each day.
Training Set
290 good spectra; 25 bad spectra
174 features (3000-20,000 mz) after data
preprocessing
Variance Components
•M A L D I – T O F
• Data preprocessing for raw
spectra
• Build a survival prediction
model from training set
• Model validation
Procedure of Constructing Survival
Prediction Model from Training Set
(1) Feature selection
– CPH model
– FDR cutoff 0.05 : 11 features associated with
survival
(2) Create a compound score as a prediction
index.
(3) Prediction model : predicted hazard rate, ,0 ,
, ( | ) ( ) exp( )
: intensity for feature of patient ij ij j j i
jiht x h t x
x ji
β =
,
1 (sign of ) k
ij jji
js wx
β
==∑ statistics Wald : jw
0 ( | ) ( ) exp(0.0235 )ii i hts h t s =
•M A L D I – T O F
• Data preprocessing for raw
spectra
• Build a prediction model from
training set
• Model validation
Overfitting
Training setIndependent test set
Model Validation
Goal : evaluate how the model
performs in the future dataset
• External validation: independent test
set
• Internal validation: training set
External Validation
Data Preprocessing for Independent Test
Set
• Calibration
• Baseline correction
• Wavelet denoise
• Normalization : median AUC from training
set
• Peak location and boundaries from
training set
Prediction for Independent Test Set
Compound score as a predictor
,
1 (sign of ) k
ij jji
js wx
β
=′′=∑
CPH : predicted hazard rate12 Freeze , , … , and from training setk ww wϕ
Association with
survival outcome.
0 ( ) ( ) exp(0.0235 )iiht ht s′′=
Validation of Predictive Model
• What to validate ?
– Predictive ability
• C-index ( Harrel et al. 1982) : measure the
agreement between predicted and
observed survival time for two subjects.
C-index ranges from 0 to 1.
1: perfect prediction; 0.5 :random prediction;
0: opposite prediction
Independent Test Sets
• Eastern Cooperative Oncology Group
(ECOG) (n=82)
• Italian (n=66)
KM Survival
<121
[121, 127)
[127,143)
>173 [143,173)ECOG (n=82)
Cox model : p < 0.001(n=61)(n=35)
(n=66)
C-index
Training set 0.77
Test set 0.62 (Italian)
Training C-index : the C-index on the training set
Generalized C-index : the C-index on the independent test set
Internal Validation
Goal : estimate the generalized C-index
through the internal validation
process
• Data splitting
• K-fold cross validation
• Bootstrap
Focus on the procedure after data preprocessing
Data Splitting
(1) Build a prediction model based on
training set
(2) Compound score for test set: winners
and Wald statistics from training set
(3) Generalized C-index: calculate C-
index from test setTraining Test
K-fold Cross Validation
Sample size n
divided into K = 5 parts,
Combine C-index from all test sets to get the
estimate the generalized C-indexC-index for each test set
Bootstrap
1BC
mBC1TC
Bootstrap training
Samples (n)
Sampling with replacementBootstrap test sample : obs not
in bootstrap training
Original training set (n)1B
2BmB …….……
1B-2B
mB–
Bootstrap
• Generalized C-index
0.5 index) – C e informativ – (non =
γ=+) (0.632C
C-index(0.632+) ranges from C-index(0.632) if there is minimum
overfitting (R=0) to if there is maximum overfitting (R=1)1
()
1(1 )
im
training T test m
iWC W C
=−+ ∑
0.632
10 . 3 6 8 ( ) [0.632, 1]R Ww e i g h t w−=∈1
()
1
– R ( relative overfitting rate ) : R [0, 1]m
Tt e s t t r a i n i n g m ii
trainingCC
C
γ=−∑
∈
∑
=m
itest T miC
1) (1
C-index
Method NSCLC
(n=35)Overfit Example
(n=77)
Training set 0.77 0.69
Bootstrap 0.71 0.53
Indep test set 0.62 (Italian) 0.28
Reproducibility of MALDI-TOF MS
Winners of Case Study (11)
EGFR+VEGFWinners of JNCI 2007 (8)
EGFR
4121 5843
4596 11445
4720 11529
4821 11685
5720 11759
5841 11903
11441 12452
11528 12579
11684
11731
11902
Take Home Message
• Good experimental design
• Precisely follow the protocol of MALDI-
TOF
• MALDI-TOF can detect the true signal
Acknowledgements
Stuart Salmon Dean Billheimer
Shuo Chen Yu Shyr
Roy Herbst Ju-Whei Lee
Anne Tsao Pierre Massion
Hai Tran Julie Brahmer
Alan Sandler Joan Schiller
David Carbone Thao P. Dang
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Analysis of MALDI-TOF Data: from [601211] (ID: 601211)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
