Data Analysis Tool Extension (DAnTE)

Category Proteomics>Mass Spectrometry Analysis/Tools and Genomics>Gene Expression Analysis/Profiling/Tools

Abstract Data Analysis Tool Extension (DAnTE) is a statistical tool designed to address challenges associated with quantitative bottom- up, shotgun proteomics data.

This tool has also been demonstrated for microarray data and can easily be extended to other high-throughput data types.

DAnTE features selected normalization methods, missing value imputation algorithms, peptide-to-protein rollup methods, an extensive array of plotting functions and a comprehensive hypothesis-testing scheme that can handle unbalanced data and random effects.

The graphical user interface (GUI) is designed to be very intuitive and user friendly.

Developed to address the issues common to proteomics data, Data Analysis Tool Extension (DanTE) is readily extendable. Though the target application is high-throughput proteomics, DAnTE has also been successfully demonstrated for microarray data analysis and can readily be applied to other forms of high-throughput ‘omics’ data that bears similar characteristics (e.g. metabolomics data).

DAnTE Application features/capabilities --

1) Data loading - The input data to DAnTE can be any file that stores tabular data, including flat files [either Comma Separated Values (CSV) or tab-delimited text files], and Microsoft Excel files.

A unique feature of the data loading mechanism is that it preserves ‘peptide-to-protein mapping’ information for use later in ‘plotting peptides’ that belong to a particular protein, as well as in the peptides- to-protein rollup methods.

In addition, DAnTE can also process SEQUEST (see G6G Abstract Number 20285) results and create ‘spectral count’ tables.

2) Factor definitions - Factors are used to capture the fixed and random effects in experimental design. For example, the ‘biological condition’ is a ‘fixed effect factor’; while a list of liquid chromatography (LC) columns used to separate the samples can be treated as a random effect.

This information is vital in normalization, imputation and ‘hypothesis testing methods’ in DAnTE. Factors can either be declared once the data is loaded or be loaded from a flat file.

3) Investigative plots - Various statistical plots, including histograms, box plots, correlation diagrams and MA (or R-I: ratio-intensity) plots can be plotted in DAnTE.

These plots help the user evaluate reproducibility within the ‘study set’ and single-out problematic datasets so that they can be excluded from further analysis.

4) Data normalization - As normalization is arguably the most important step in downstream data analysis, DAnTE employs several normalization methods that have been successfully tested for both ‘proteomics data’ and ‘microarray genomics data’.

Among them are a ‘robust linear regression’ method, lowess method and a ‘quantile normalization’ method. In addition, global intensity adjustment based on median absolute deviation (MAD) and central tendency adjustment methods are also available.

5) Missing value imputation - Incomplete datasets due to ‘missing values’ are common with high-throughput proteomics. As imputing these values is a much-debated topic, DAnTE offers several simple methods, as well as some advanced algorithms to choose from.

The simple methods allow the user to fill in missing values with either the dataset mean/median or with a pre-chosen constant.

Advanced methods include filling in with a ‘row mean’ based on a user- defined factor, K-nearest neighbor imputation (KNNimpute), and singular value decomposition-based imputation (SVDimpute).

6) Peptide-to-protein rollup - In most proteomics methods, peptide measurements are rolled up to corresponding protein abundances.

Ideally, all peptides from a single protein should have similar abundances that manifest as similar ‘signal intensities’; however, in reality many factors, such as digestion efficiency, electrospray ionization efficiency, etc., can affect the identifications and abundances or signal intensities of peptides.

In the RRollup method available in DAnTE, peptides that originate from the same protein are first scaled on the basis of a chosen ‘reference peptide’ in order to bring all ‘peptide profiles’ across biological conditions to the same level and then averaged to obtain the protein abundance.

During scaling, the peptide with the most observations is chosen as the reference peptide and its total abundance across datasets is used as a tiebreaker.

In the ZRollup method, a scaling method similar to z-scores (except that medians instead of means from peptide profiles across biological conditions are used) is applied first to peptides that originate from a single protein and then the scaled peptides are averaged to obtain relative protein abundance.

In both RRollup and Zrollup methods, outlying peptide values are excluded from protein abundance calculations, using a Grubb's outlier test.

In the third QRollup method, peptides are selected on the basis of a user selected abundance cutoff value, and protein abundance is calculated as the average of these selected peptides.

7) Analytical algorithms - DAnTE offers several well-characterized algorithms to further explore patterns in the data. Traditional principal component analysis (PCA) and associated scores and loadings plots can be useful as an unsupervised way of finding the principal variation in the data.

In contrast, the partial least squares method available in DAnTE can be used as a discrimination procedure whereby the grouping information is assigned using factors.

Hierarchical and k-means clustering methods on features/samples are also available as part of the heat map plotting function.

8) Hypothesis testing - A comprehensive ANOVA scheme for ‘unbalanced studies’ that uses marginal sums of squares and mixed models is included in DAnTE.

The user can also test for interactions among factors in a multi-way analysis of variance (ANOVA). The q-values are also calculated along with the p-values in order to control the false discovery rate in multiple testing.

In addition, DAnTE can check whether the data follows a normal distribution by employing the Shapiro-Wilks test and features two non- parametric hypothesis tests (Wilcoxon rank sum test and Kruskal- Wallis test) when the normality assumption fails to hold.

System Requirements

Web-based.

Manufacturer

Manufacturer Web Site DAnTE

Price Contact manufacturer.

G6G Abstract Number 20549

G6G Manufacturer Number 104020