PileLine (Pileup pipeLine)
Category Cross-Omics>Next Generation Sequence Analysis/Tools and Cross-Omics>Workflow Knowledge Bases/Systems/Tools
Abstract PileLine (Pileup pipeLine) is a novel and flexible command-line toolbox for efficient management, filtering, comparison and annotation of Genomic Position (GP) files.
The toolbox has been designed to be memory efficient by performing fast seek on-disk operations over sorted GP files.
Based on the combination of basic core operations, PileLine provides several functionalities, including:
1) Quick filtering and search within GP files without indexing steps;
2) Full standard annotation with human dbSNP, HGNC Gene Symbol, and Ensembl IDs;
3) Custom annotation through standard .bed or .gff files;
4) Two sample (i.e.: case VS control) and n sample comparison at the variant level;
5) Generation of ‘Sorting Tolerant From Intolerant’ (SIFT) algorithm;
Firestar - (an expert system for predicting ligand-binding residues in protein structures); and
PolyPhen (Polymorphism Phenotyping) server compatible outputs for predicting the consequences of non-synonymous coding variants on protein function;
6) Genotyping quality control (QC) test for estimating performance metrics on detecting homo/heterozygote variants against a given gold standard genotype; and
7) Modular design to facilitate the inclusion of new functionalities.
The PipeLine toolbox contains 10 command-line utilities that have been designed to be memory efficient by performing on-disk operations over sorted GP files.
By combining their execution using different arguments and several options the user is able to sketch and execute diverse workflows that can be enhanced by using third party software applications.
PileLine Implementation --
PileLine was coded in Java and consists of a set of command-line utilities (as stated above...) that are easy to integrate in custom workflows or user-friendly frameworks like Galaxy.
The tools comprising PileLine are focused on two (2) different but complementary activities:
1) Processing and annotation, implementing simple but reusable operations over input GP files; and
2) Analysis, giving support to more advanced and specific requirements.
The primary input data of PileLine are GP files (e.g. .pileup from SAMtools - see below...) containing the chromosome name and the coordinate position as the two (2) first columns.
The main design principle of the PileLine toolbox is to avoid loading input data into memory, so core functions operate directly on disk.
One of the available command-line tools is fastseek, which performs a direct binary search on sorted GP files without requiring an additional index to be created.
This functionality provides direct access to any range of genomic coordinates without loading the whole file into memory.
Initially, fastseek finds the first and last lines of each sequence and next, performs a binary search on the lines belonging to the queried sequence in order to find the first position within the specified range.
The second design principle of PileLine is focused on flexibility and modularity.
Thus, PileLine tools may be combined with standard UNIX commands allowing custom data analysis workflows.
Moreover, the modular design of the PileLine toolbox facilitates the inclusion of additional functionalities (as stated above...).
With respect to the file formats, while 2smc, nsmc, pileup2sift, pileup2polyphen and pileup2firestar work with specific SAMtools .pileup file format; fastseek, fastjoin, rfilter, sort and genotest work with generic GP files (i.e.: .pileup, .vcf, .gff, .bed, etc.).
Summary of PileLine functionalities --
Processing and annotation -
fastseek - Retrieves all lines within a specified genome range.
fastjoin - Joins two GP input files by genomic coordinate. It can also perform left- and right- outer joins which print orphan lines.
rfilter - Selects only those positions inside at least one of a given set of intervals (.bed or .gff files). It also implements an annotation mode to report all positions plus an extra column containing all the intervals in which each position is contained.
sort - Sorts a GP file by genomic coordinate. SAMtools generated pileup files are usually sorted.
pileup2sift - Generates a SIFT-compatible change column for each variant line in the GP file.
pileup2polyphen - Generates a Polyphen2-compatible change column for each variant line in the GP file.
pileup2firestar - Generates a firestar-compatible input for each variant line in the GP file.
Analysis -
2smc - Compares two samples (i.e. case VS control) by retrieving all positions where the genotype is discrepant between the two samples. For each sample a variant GP file is needed, as well as the complete GP file (which includes the invariant positions).
nsmc - Compares n samples of two conditions (i.e. case VS control). Taking one GP file per sample, it reports those samples containing each position and also performs a Fisher’s exact test to find reproducible and characteristic positions.
genotest - Performs a QC test on genotyping. Compares two genotypes (experimental VS gold standard) and evaluates the performance on detecting homo/heterozygous variants. It also generates data to plot a ROC curve in order to estimate the best SNP quality threshold.
PileLine GUI --
PileLine GUI is a front-end to the PileLine toolbox, plus a Genome browser.
With this intuitive graphical desktop application you can run the following tasks:
1) Processing commands of GP files, like seek, join, annotate, and filtering.
2) Perform 2-samples and n-samples point somatic mutation calling (via the PileLine 2smc and nsmc commands).
3) Browse GP files in an interactive local Genome browser.
SAMtools software package --
SAMtools is a library and software package for parsing and manipulating alignments in the Sequence Alignment/Map (SAM) / Binary Alignment/Map (BAM) format.
It is able to convert from other alignment formats, sort and merge alignments, remove PCR duplicates, generate per-position information in the pileup format, call SNPs and short Indel variants, and show alignments in a text-based viewer.
SAMtools has two (2) separate implementations, one in C and the other in Java, with slightly different functionality.
System Requirements
Contact manufacturer.
Manufacturer
- Higher Technical School of Computer Engineering
- University of Vigo
- Ourense, Spain
- And
- Bioinformatics Unit (UBio)
- Structural Biology and Biocomputing Programme
- Spanish National Cancer Research Centre (CNIO)
- Madrid, Spain
Manufacturer Web Site PileLine (Pileup pipeLine)
Price Contact manufacturer.
G6G Abstract Number 20780
G6G Manufacturer Number 104357