Features’ compendium for machine learning in NGS data Analysis

Alokkumar Jha, Abhinav Khare, Randeep Singh



Current studies on the cancer genome, majorly involve use of next generation sequencing (NGS) technologies followed by data analysis pipelines. Many of these pipelines comprise of tools using machine learning algorithms especially for downstream analysis. Features are important components of machine learning systems and inclusion of informative features improves the accuracy of the machine learning algorithms. The algorithms used inNGS analysis leads to the generation of huge feature space.Sometimes, this high dimensionality leads to slower analysis time and lesser accuracy due to inherentbias of the model and/or redundancy of fewfeatures. With growth and interest in NGS studies, there has been a rapid development of new NGS analysis tools and improvement in the performance of the previous ones byincluding new features and excludingthe redundant ones. To enable these development, there is a dire need for standardizing this plethora of features available from literature.


Current work presents a compendium of features that have been used in the literature for machine learning in NGS data pipeline and analysis. The features have beenfurther classified, assuming each stage of NGS data processing as individual category. The simple classification is a) Pre-processing features (b) Sequencing technology specific features (c) Downstream featuresor features for biological interpretation and analysis. This categorization will facilitate the use of correct features in a simplified manner.


The work will facilitate a uniform model for NGS tools development that utilize machine learning approaches for study of cancer data.A model for feature database and management based on this standardization is also proposed.


NGS cancer data, standard for features, feature selection, feature classification

Full Text:



Ding J, Bashashati A, Roth A, Oloumi A, Tse K, Zeng T, Haffari G, Hirst M, Marra M a, Condon A, Aparicio S, Shah SP: Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data.Bioinformatics 2012, 28:167–75.

Ghaffari N, Yousefi MR, Johnson CD, Ivanov I, Dougherty ER: Modeling the next generation sequencing sample processing pipeline for the purposes of classification.BMC Bioinformatics, 2013. 14: 307.

Huynh-Thu VA Saeys Y, Wehenkel L, Geurts P, Statistical interpretation of machine learning-based feature importance scores for biomarker discovery. Bioinformatics, 2012. 28(13):1766-74.

Ortu F, Valenzuela O, Pomares H, Rojas I: Determining the most suitable multiple sequence alignment methodology by using a set of heterogeneous biological features . 2013:18–20.

Gunnar R: Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Discovery of the Nuclein. In NGS Bioinformatics Meeting, March 24, 2010,Paris,2010.

Goode DL, Hunter SM, Doyle M a, Ma T, Rowley SM, Choong D, Ryland GL, Campbell IG: A simple consensus approach improves somatic mutation prediction accuracy.Genome Med 2013, 5:90.

Vezzi F, Narzisi G, Mishra B: Feature-by-feature--evaluating de novo sequence assembly.PLoS One 2012, 7:e31002.

Kruczyk M: Rule-Based Approaches for Large Biological Datasets Analysis:A Suite of Tools and Methods.DisseratationUppsala Universitatis ; 2013.

Li M, Stoneking M: A new approach for detecting low-level mutations in next-generation sequence data.Genome Biol 2012, 13:R34.

Magi A, Benelli M, Gozzini A, Girolami F, Torricelli F, Brandi ML: Bioinformatics for next generation sequencing data.Genes (Basel) 2010, 1:294–307.

Rantapero T: Bioinformatics analysis of next generation sequencing data. Master's thesis University of Tempare ,Finland , Institute of biomedical technology; 2012.

Jiao W: Machine Learning for Variant Detection and Population Analysis in Heterogeneous Cancer Samples in Heterogeneous Cancer Samples. Master's thesisUniversity of Tokyo,Department of Molecular Genetics; 2013.

Yip KY, Cheng C, Gerstein M: Machine learning and genome annotation : a match meant to be ?Genome Biology 2013, 14:205.

Gunnar R: Machine Learning Challenges in Analysing Short Read Sequencing and Tiling Array Data.Computational Statistics Workshop, 6th July,2009,London,UK, 2009.

Liu J, Jennings SF, Tong W, Hong H: Next generation sequencing for profiling expression of miRNAs: technical progress and applications in drug development.Journal of Biomed Sci Eng,NIH Public Access. 2012, 4:666–676.

Bashashati A, Haffari G, Ding J, Ha G, Lui K, Rosner J, Huntsman DG, Caldas C, Aparicio SA, Shah SP: DriverNet : uncovering the impact of somatic driver mutations on transcriptional networks in cancer. Genome Biol 2012, 13:R124.

Koenig D, Weigel D, Borgwardt K, Grimm D: Accurate indel prediction using paired-end short reads.BMC Genomics 2013,14:132.

Okser S, Pahikkala T, Aittokallio T: Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives.BioData Min 2013, 6:5.

Capriotti E, Altman RB, A new disease-specific machine learning approach for the prediction of cancer-causing missense variants.Genomics 2011, 98(4): 310-7.

Chapman B:Updated comparison of variant detection methods: Ensemble, FreeBayes and minimal BAM preparation pipelines. [http://bcbio.wordpress.com/?s=Variant+evaluation+overview]

Palmer LE, Dejori M, Bolanos R, Fasulo D: Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction. 2010.

1. Ryvkin P, Leung YY, Ungar LH, Gregory BD, Wang L-S: Using machine learning and high-throughput RNA sequencing to classify the precursors of small non-coding RNAs.Methods 2014, 67:28–35.

1. Carter H, Chen S, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, Vogelstein B, Karchin R:Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations.Cancer Res 2009,69(16): 6660-7.


  • There are currently no refbacks.