Emory Genetics develops New Genetic Tests, Next-Generation Technology
Improve Diagnosis of Congenital Disorders using NextGENe software
Read more:
http://shared.web.emory.edu/whsc/news/releases/2010/05/new-genetic-tests-technology-improve-diagnosis-of-congenital-disorders.html
NextGENe® is compatible with the Applied BioSystems SOLiD™ System,Roche Genome Sequencer FLX™ and Illumina Genome® Analyzer and is designed in a biologist friendly Windows® environment significantly reducing the need for additional bioinformatics resources and costs. NextGENe utilizes low cost desk 64-bit hardware configuration with a minimum of 8 GB RAM.
NextGENe Applications:
Condensation Tool
SNP & INDEL Analysis
Variant Call Confidence Scoring
Variant Comparison Tool
NextGENe Viewer/Browser
Structural Variant Detection
Paired End Read Merging provides Sanger Quality sequence
Roche Genome Sequencer FLX SNP & Indel Detection
De novo assembly
Whole Genome Alignment
RNA-Seq Analysis
ChiPSeq Analysis
Metagenomic Studies of Viral and Bacterial Infections
Serial Analysis of Gene Expression (SAGE) Studies
miRNA Discovery, Quantification
Sequence Analysis Using Barcode/Index Tags of Pooled Samples
Deep Sequencing Analysis
Targeted Enrichment Sequence Analysis
Alignment of Paired Read
Condensation Tool
NextGENe®, with its patent pending Condensation Tool®, solves the 3 critical problems of 2nd generation sequence analysis:
- Reads too short
- High Error Rates
- Overwhelming Data Volume
Critical to most applications, as well as assembly, The Condensation Tool statistically polishes and lengthens short sequence reads into highly accurate fragment sizes which are now unique and manageable.
A quick before and after comparison:
| |
Without Condensation |
With Condensation Tool® |
| Read Length |
36bp |
60bp |
| Error Rate |
2-3%< |
0.10% |
| Data Volume |
30,000,000 bp |
3,000,000 bp |
| Genome Matching |
Low |
Highly Accurate |
| INDEL Detection |
No |
Yes |
Short reads from the SOLiD System as well as the Illumina Genome Analyzer System are often not unique within
the genome being analyzed. By clustering similar reads containing a unique anchor sequence, data of adequate
coverage is condensed and the short reads are lengthened. The unique anchor sequence, or index, is a 12 base
fragment that is found in several of the reads. All reads containing this exact sequence are clustered together.
Often, many of the reads within a cluster contain homologous nucleotides both upstream and downstream of the
index sequence. The read clusters can be sorted by these flanking shoulder regions into groups of similarity. Then NextGene scans all possible 16.7 million indices to cluster the reads, providing all possible consensus sequences. The consensus of these groups is much larger in length, and often these 50 to 65 base pair fragments are unique withinthe genome, with exceptions such as homopolymeric regions, repeats and duplications. The elongation of the
short reads is essential to INDEL detection across fragments.
|
|
2nd Generation data prior to “clean-up” with Condensation Tool has high error rate, indicated by grey highlights, making accurate analysis tedious and time consuming.
|
Same data set following use of Condensation Tool, 99.9% of the sequencing errors are removed allowing accurate identification of true SNP’s and INDELS.
|

Condensation Tool clusters similar anchor sequences (CTGGGGTTACAG). The right shoulder of 8 nucleotides is divided into two groups differing in sequences of GTGTGAGC and GTGCCTGC. A consensus sequence is generated for each group, almost doubling the read lengths.
Condensation Application Note (PDF)
File Formats Application Note (PDF)
SNP & Indel Analysis
- 99% Accuracy in SNP Detection
- Small and Large INDEL detection
- Easy Gene Annotation
- Simple Navigation
- Rapid Review of Variants, Nucleotides and Amino Acids
- Quick Links to data bases
- Easy exporting of results
NextGENe software, in combination with the Condensation Tool provides high accuracy and sensitivity for the analysis of both short and long sequence reads. Paired End reads, with the Condensation Tool can be merged into one exceptionally accurate read equal to roughly the library size.
INDEL detection is unique with NextGENe Software. Deletions up to 33% of read length and Insertions up to 20% of read length are easily detected by NextGENe. Several cycles of the Condensation Tool can be applied to elongate short reads in order to locate large INDELS:
55 bp deletion detected using 50 bp reads

NextGENe software’s unique Condensation Tool elongated 50bp reads in order to discover this 55bp deletion, above, which is confirmed by Sanger Sequencing, lower image.
Data Compliments of Emory Genetics
SNP Detection

In this SOLiD™ data, left, a heterozygous C-CT was verified by Sanger Sequence (right).
Data Compliments of Emory Genetics
Download Application Notes:
Illumina SNP & INDEL Detection with NextGENe software (PDF)
SNP & INDEL Detection of SOLiD™ System Sequence Data with NextGENe (PDF)
SNP & Indel Detection of 454 Sequence Data with NextGENe (PDF)
Working with Mutation Scores in NextGENe software (PDF)
Working with Capture Data Application Note (PDF)
Request 30 day evaluation program
Variant Call Confidence Scoring
In order to simplify analysis NextGENe includes a mutation scoring system that is compatible with data from several sequencing platforms including the Roche GS FLX™ and FLX Titanium, Illumina Genome Analyzers and Applied Biosystems SOLiD™ Systems.
When analyzing next-generation sequencing data there are many variables that affect the accuracy of mutation calls:
- Coverage
- % of reads containing variants
- Directionality
- Homopolymer errors in Pyrosequencing
- Misalignments
NextGENe’s mutation scoring system makes analysis of large projects faster and easier by identifying the mutation calls that are likely to be false positives. It is flexible enough to account for several different sources of error and to ignore some of them if necessary. All types of data are treated equally (with the exception of the homopolymer score, used only for 454 data) so that data from different sequencing systems can be directly compared. The filtering tool allows for easy adjustment of sensitivity (allow lower scores) and specificity (allow only high scores) in mutation detection projects.
A Phred-like quality score is assigned to every mutation call, allowing the user to quickly identify the mutations that are most likely to be real. The software takes several variables into account in order to make an empirical estimate of the probability that any one mutation call is a true SNP or Indel. A score of 10 means there is approximately a 1 in 10 chance that the called mutation is the result of sequencing or alignment error while a score of 30 (the maximum score) means there is very little chance (0.1% or 1 in 1000) that it is not real. A low score indicates only that a potential mutation cannot be confidently distinguished from error based on the available data.
The final mutation score is the product of several sub-scores that account for different sources of error:
- The Coverage Score (between 0 and 30) is based on the total number of reads and the number of reads with the mutation allele. 50 or more reads with the mutant allele results in the maximum score regardless of the total coverage, although 12 such reads are sufficient for a good (20+) score if the coverage isn’t high.
- The Read Balance Score (between 0 and 1) is based on the relative number of reads aligned in each direction. A mutation site with coverage in only one direction is less reliable because the quality scores within a read are different in the 5’ and 3’ ends.
- The Allele Balance Score (between 0 and 1) is based on similar principles. The relative directional balance for reads with and reads without the mutation are compared.
- The Homopolymer Score (between 0 and 1) penalizes indels that occur in homopolymer regions because they are more likely to be errors. Longer homopolymers have greater penalties because errors are more likely. This score is only used for Roche/454 data but it can be disabled.
- The Mismatch Score (between 0 and 1) penalizes mutation calls when several variants (even some failing the mutation filter) are found close together.
Examples:

The overall score for this mutation is reduced from 19 to 8 when the homopolymer score is used because the mismatch is an indel in a homopolymer region present in pyrosequencing data. Several insertions and deletions in the same homopolymer can be seen in the alignment.

The mutation scores for these three sites are very low because of the Allele Balance Score. The mutations occur in all of the reads aligned in one direction and none of the reads aligned in the opposite direction which indicates a directional bias possibly due to PCR artifacts. When the Allele Balance Score is not used the overall scores are 21, 22, and 22.

Often targeted capture data is highly directional- at this position all of the reads are aligned in one direction. Disabling the Read Balance Score for this type of data is recommended to avoid lowering every mutation score. At this position the mutation score increased from 19 to 30 when the Read Balance Score was disabled. The Read Balance Score also lowered the total score for many other positions because the data is also directionally biased there.

The mutation calls caused by barcode sequences in these reads all have low scores because the
Mismatch Score is very low..
Download application note: Working with Mutation Scores in NextGENe software
Request 30 Day Trial
Variant Comparison Tool
To facilitate review of multiple Variant Discovery projects NextGENe includes a variant Comparison Tool. Up to ten projects that utilized the same reference can be compared at one time. The software automatically determines and color codes similarities as well as differences found in each patient. Additional filtering options are available to simplify the review process.

Automated comparison of two completed projects displays variant similarities and differences between patients, up to 10 can be compared at one time with this NextGENe software tool.
Working with Variant Comparison Tool Application Note (PDF)
NextGENe Viewer/Browser
A comprehensive SNP & INDEL viewer is included in the NextGENe™ software suite for massively parallel sequencing systems. This new tool developed at the request and in collaboration with several prominent researchers permits easy gene annotation; simple navigation within the genome; rapid view of variations in nucleotides and amino acids; quick links to NCBI and dbSNP databases as well as the ability to view and export consensus sequence providing biologists with an easy-to-use interface for review of the massive amounts of data generated by the next generation sequencing systems.
SNP Analysis

Click for a larger view of the diagram.
INDEL Detection

Click for a larger view of the diagram.
Structural Variant Detection
Structural variations, including insertions, deletions, inversions, gene fusions and copy number variants, occur frequently across the human genome and have been shown to be important in a number of diseases. Although there are several useful technologies for detecting SV’s all have their limitations. Paired-end read mapping has been used to detect shorter deletions and to hone in on breakage sites but is unable to detect structural variants larger than the library size.

Theoretical alignment of reads to structural variants. Dashed lines represent non-aligning sequence data.
NextGENe makes it easy to find and map structural variants with sequence data from the Roche FLX Titanium system and short paired read data of the GAII and SOLiD System using the NextGENe’s new pair linking technology which allows large mismatches when aligning to genome reference in order to detect SV’s. NextGENe then displays the information about those regions in a special structural variant report.
Short paired reads such as those generated by the Illumina GAII and AB SOLiD System can be utilized by NextGENe to detect the SV’s. NextGENe’s analysis wizard allows users to choose between short paired reads and longer reads to perform structural variant detection.

Above is a detailed view showing how NextGENe highlights the mismatched portion of a read. The above is an example of a detected fusion gene.
For long reads NextGENe generates pseudo-paired reads for the sequence aligned to these regions by breaking the original reads into pairs. These are then aligned to the reference genome mapping the structural variants with unerring results. Detailed information on where these reads align is available in NextGENe’s Paired Read reports.
Structural Variant Detection Application Note (PDF)
Request 30 Day Trial
Paired End Read Merging provides Sanger Quality sequence
Short read lengths produced by Next Generation sequencers such as the Illumina Genome Analyzer can create difficulty for accurate analysis of data. Additionally, relatively high error rates (compared to other technologies such as Sanger sequencing) further complicate the analysis of next-gen sequencing data. For these reasons, lengthening the short reads prior to analysis and statistically correcting or removing errors is a valuable tool to improve alignment accuracy, indel detection and assembly. A novel, highly accurate method to elongate reads, utilizing paired end reads, has been developed by SoftGenetics for its NextGENe software.
Sequencing Paired End reads is a useful technique which produces reads in pairs such that each pair of reads are a known distance from each other in the genome. This is accomplished by preparing DNA fragments of a certain length (200 bp, for example). This fragment size, or library size, is the distance between each pair of reads. Sequencing is then done from each end of the fragment, producing two paired reads. NextGENe’s paired end merging technique takes advantage of paired end information, along with the additional coverage from sequenced overlapping DNA fragments, to produce long reads spanning the entire library size with an extremely low error rate.

Multiple Cycles of Condensation can be used to elongate the paired reads, forming an overlap
Paired end reads can be merged by elongating the paired reads to the point that there is overlap between the two reads. This allows the paired reads to be joined together to form one continuous, longer read. The number of elongation cycles required depends on the read lengths and the library size. Each cycle of Condensation will generally increase the average read length to 1.6 the original length for shorter (<=36 bp) reads and to 6 bases less than twice the original length for longer (>36 bp) reads. These values may be reduced with an average depth of coverage less than 30x. A single cycle of elongation of 75 bp reads from a 200 bp library, for example, allows the paired reads to overlap and be linked together. For 35 bp reads from a 200 bp library, three cycles of elongation allows for the linking of the paired reads. Reads should be extended until a significant portion of the paired reads (roughly 15% of the elongated read length) will be expected to overlap.

Average read lengths after elongation for varying original read lengths
Download Application Note:
Merging Paired End Reads Application Note (PDF)
SNP & INDEL detection of Pyrosequencing Reads for the Roche Genome Sequencer FLX™ System
The new NextGENe module specifically addresses the homopolymer related errors of the FLX system by utilizing the FLX’s high coverage to statistically polish and correct the inherent system errors. Additionally NextGENe’s exclusive alignment tool incorporates an automated flexibility tool that permits base pair mis-match between the sample sequences and reference in addition to the absolute alignment value. This allows NextGENe to accurately align reads with long INDELs and identify them as mutations.
Increased accuracy by further correction of the FLX SNP & INDEL data can be accomplished through cross platform alignments. The consensus FLX sequence can be easily compared to either Sanger and Illumina® Genome Analyzer sequence within NextGENe in order to further correct homopolymer related mis-calls and errant alignment. This new consensus can be saved as a reference file for future analyses.
NextGENe’s SNP detection application for data from the Genome Sequencer FLX System is able to effectively identify single nucleotide polymorphisms (SNPs) by accurately aligning sample reads with a reference. Additionally, the NextGENe Alignment tool is designed to match the sequence reads to a user-defined annotated reference sequence. Multiple methods are available for aligning the reads to the reference. Once the reads have been aligned, SNPs and Indels are highlighted for quick identification.
The Sequence Alignment Tool also provides information about amino acid changes, exon-intron boundaries, and copy numbers, and assists with the determination of methylation sites. Interactive reports displaying the variations and statistics can be produced and exported. Variations identified by the software can be linked directly to NCBI dbSNP database.

NextGENe FLX SNP & INDEL viewer provides “at a glance” information on the analysis results. Software’s color-coding differentiates between Known and Novel variants, indicates depth of coverage, CDS region, as well as mRNA. The software also provides Gene Name annotation as well as resultant changes in Amino Acids.
Download Application Note:
SNP & INDEL Detection from FLX System with NextGENe (PDF)
De novo Assembly
- Automatically forms anchor sequences
- Paired Read (Mate Pair) Assembly forms contigs greater than 100kb (data specific)
- Assembles Illumina Genome Analyzer, SOLiD System and Roche FLX data
- Completely automated, no script writing necessary
- Provides critical review and documentation of assembly results
E. Coli Assembly Results by Instrument System:
| |
Illumina ®Genome Analyzer |
Roche Genome Sequencer |
| Contigs |
347 |
492 |
| N50 |
204,665 |
19,549 |
Contig Size
AVG.
Max. |
13,263
560,819 |
60,679
9,238 |
| Correct Match % |
96.9 |
95.3 |
De novo sequence assembly of the short reads from next generation genome analyzers presents many challenges. With many of the current techniques, it is difficult to assemble the short reads into a large contig of 1 to 30kbs. These techniques often create many false alignments due to two major issues: short reads with high base calling errors and ambiguity within the genome. The short reads with SNPs and Indels are often discarded, which is problematic for SNP/Indel detection as well as for the determination of copy number variations in applications such as chromatin immunoprecipitation (ChIP), Digital Gene Expression studies (DGE) and transcriptome analyses.
NextGENe software was developed to assist researchers in resolving these inherent issues in analyzing next generation sequencing data. NextGENe’s unique Condensation Tool™ is used to polish and lengthen short sequence reads into fragment sizes that are more unique and accurate. The Assembly Tool is then used to assemble the short reads into contigs of 0.5 kb to greater than 100kbs.
De novo Assembly
NextGENe offers multiple methods for de novo Assembly in order to provide accurate assemblies of reads from different instruments and in different forms (mate pair reads vs. single reads). The first method uses a de Bruijn graph technique. Ideal for short reads of Illumina and SOLiD platforms, this method is capable of utilizing paired reads information to assist with proper assembly of large contigs, but can also be used without paired end data. This assembly method is quite memory intensive, and some large datasets require 32GB of RAM to assemble with the de Bruijn method. When choosing to use the de Bruijn method of assembly, the condensation steps are omitted.
NextGENe also offers an alternative method of assembly that is less memory intensive than the de Bruijn method and may be more ideal from some sets of data. When completing multiple cycles of condensation and coverage information is not needed, the Method 2 option can be used for assembly of large contigs.
For the longer reads of Roche/454 datasets, NextGENe offers a 454 Assembler. This assembly method is tailored to accurately assemble 454 reads. The error correction feature from the Condensation step can be used prior to assembly to correct homopolymer errors and improve accuracy and length of assembled contigs.

Condensation Assembly Tool elongated the 35 bp reads to approximately 60 bp while removing many of the random errors produced by the instrument.
Paired Read (Mate Pair) Assembly with NextGENe
NextGENe uses a de Bruijn graph method for assembly of paired read (mate pair) data from Next Generation Sequencers such as the SOLiD System and the Illumina Genome Analyzer (Solexa). This method involves using short words, not entire reads, as indexes to develop the graph which reduces redundancy. Reads are mapped as a path along the graph with nodes representing overlaps and arcs between nodes representing links. This assembly technique for paired reads (mate pairs) is able to accurately produce large contigs greater than 100 kbps from short next generation sequencing reads.

NextGENe is able to produce large contigs, many between 1 kbps and 100 kbps, from paired read data. Additionally, several assembled contigs are generated that exceed 100 kbps. Results shown were obtained using genomic data from E. coli, which has a total genome size of roughly 4.6 Mbp.
The use of paired-end or mate-pair sequence reads is a valuable tool for constructing de novo assemblies from short sequence reads. Next Generation Sequencing platforms have allowed for sequencing paired reads in a shorter time span for lower cost. However, the volume of data produced in the form of short reads with high error rates presents a challenge for data analysis.
Paired read analysis involves the use of DNA fragments containing two regions of sequenced DNA separated by an unsequenced insert of known length. Paired reads enhance assembly of short reads by improving the specificity of the reads since single short (25-36bp) sequencing reads, as produced by next generation technologies, are not significantly unique in the genome for accurate assembly.
De novo Assembly Application Note (PDF file)
De novo Assembly of SOLiD Sequence Reads Application Note (PDF file)
Paired Read Assembly Application Note (PDF file)
De Bruijn Assembly Application Note (PDF file)
Whole Genome Alignment
NextGENe employs a modified Burrows-Wheeler transform alignment method to generate fast, accurate and highly informative alignments of large genomes such as human, rat, mouse and others.
NextGENe’s whole genome alignment method is the first to align reads from the Roche Genome Sequencer FLX System, which often contain many indels due to homopolymer errors, to a whole genome reference with high speed. The whole genome alignment algorithm is also capable of quickly aligning SOLiD™ System and Illumina Genome Analyzer data. Additionally, NextGENe’s whole genome alignment tool features complete annotation of the reference.
The whole genome alignment algorithm aligns reads to the whole genome by matching seeds smaller than the read length and then extending the alignment to find the best matching position for the whole read. This allows for the alignment of long reads and reads with indels.
Typical Processing Time using desktop system:

Upon completion of a whole genome alignment project the results are automatically displayed in a single Sequence Alignment window. The Sequence Alignment window displays coverage information, aligned reads, complete annotations and provides access to reports. NextGENe’s whole genome alignment function includes annotation information:

Whole genome alignment results are displayed in the sequence alignment window. The top pane uses gray regions to indicate coverage across the genome. As shown in this figure, the view can be zoomed in to show a detailed view of a region. The blue, gold and green lines are used to indicate gene, CDS and mRNA locations, respectively. Tick marks are used to indicate SNP locations with blue marks indicating novel SNPS, purple indicating known and green indicating negative SNPs. The bottom pane shows aligned reads with SNP positions highlighted. The middle pane provides the reference and consensus nucleotide sequences as well as the amino acid sequences. The gene name is also provided.
A mutation report lists all the detected mutations, provides annotation information for each position as well as coverage and allele frequencies. The report can be easily edited, saved and exported in multiple formats.

The mutation report lists all identified mutations and provides information for each including the gene name, chromosome position, reference nucleotide, coverage, allele frequencies and dbSNP identification. Clicking on the dbSNP ID provides a direct hyperlink to the NCBI website.
Download Whole Genome Application Note
RNA-Seq Analysis
- Allows alternative splicing analysis
- Provides accurate copy number even in presence of several variations
- Expression ratio of multiple alleles differing by SNPs/Indels
- Able to accurately align reads at exon-intron boundaries.
Analyzing an organism’s transcriptome with the Next Generation Sequencing technology
presents several challenges, including a high level of sequence variation to the reference genome
due to SNPs/Indels, multiple transcripts for each gene and high variability in expression rates.
NextGENe’s proprietary RNA-Seq alignment algorithm can accurately align reads spanning
exon junctions. This technique uses a four-step methodology – aligning reads to the genome,
aligning reads to the exon junctions, detecting and linking exons and aligning to a transcript
reference.
Following the alignment, transcripts are identified, mutations are detected and expression levels
are reported.

NextGENe’s RNA-Seq analysis module accurately aligns reads spanning exon junctions,
detecting and linking exons after aligning to transcript reference.
Transcriptome Analysis Application Note (PDF file)
SOLiD Transcriptome Application Note (PDF file)
ChIP-Seq Analysis
NextGENe provides a software module specifically designed for ChIP-Seq analysis. This application aligns sample reads to a reference sequence, utilizes coverage information for the detection of peaks to indicate protein binding sites and provides a specialized report to provide information about each peak region.
Once NextGENe completes alignment of the samples to the reference, the results are automatically displayed in the Sequence Alignment window. When “ChIP-Seq” is selected as the Application Type, automatic peak detection is applied during the initial processing and peak regions are indicated in the Sequence Alignment Window upon project completion. Brown bars are shown to graphically illustrate regions where peaks have been identified.

ChIP-Seq Analysis Application Note (PDF file)
Metagenomic Studies of Viral and Bacterial Infections using NextGENe
Quick and accurate identification of viral and bacterial pathogens using next gen sequencing is valuable for both treatment and research. Early identification of an infectious agent can lead to a targeted treatment. Fast identification of novel pathogens- new viruses, bacteria, and other microorganisms- such as swine flu and Methicillin-resistant Staphylococcus aureus (MRSA) will speed up vaccine and new drug development. Drug efficacy studies are becoming easier through the use of Metagenomic analysis of viral, bacterial and human sequences. The viral and bacterial concentration can be determined from the sequence reads after human background is subtracted using NextGENe software.
The traditional methods for detecting and identifying pathogens require culturing bacteria and viruses or detecting viral antigens. These procedures have several problems- they are costly in both time and money and there is a limit to their sensitivity because some viruses and bacteria are very difficult or impossible to culture, especially from small samples. Nucleic Acid Amplification tests (NATs) are a newer approach. Most involve PCR or Loop-mediated isothermal amplification (LAMP) PCR of DNA or RNA. Those that use multiplexed PCR or DNA microarrays offer a great advantage over the older methods because they are faster, more sensitive, and introduce less bias, but they are not without their own problems. The tests are limited by a relatively small number of candidate pathogens and while they may detect new strains, they aren’t always useful for characterizing them.
Next generation sequencing technologies make it possible to obtain millions of reads in a single run. NextGENe is able to quickly separate most host genome contamination from samples before aligning the remaining reads to bacterial or viral genomes for identification

Alignment of several bacterial genomes following removal of host contamination with NextGENe software
Download Application Note: Metagenomic Studies of Viral and Bacterial Infections
Serial Analysis of Gene Expression (SAGE) Studies
- Expression Report (Gene count, ambiguities)
- New Genes are listed separately
- Search Tool
- Display Biological Information for each tag
Next Generation Sequencing systems such as the AB SOLiD™ System, Roche Applied Science’s Genome Sequencer FLX System (454 Sequencing) and the Illumina® Genome Analyzer utilizing Solexa Sequencing Technology promise to provide breakthroughs in Digital Gene Expression studies like SAGE (1). SAGE technology measures the counts of sequence tags relative to the genes of interest. The SAGE tags are produced by the restriction enzymes which cut the cDNA while the poly(A) end is bound to the biotin-labeled dT primer. The biotin is captured on beads so that biotin-labeled cDNA strands are bound and all other DNA strands are washed away. This method, in addition to similar techniques of MicroSAGE, LongSAGE, RL-SAGE and SuperSAGE, offers effective tools to analyze absolute expression numbers by counting tags.
NextGENe software package takes full advantage of the short sequencing reads and has tools for analyzing the SAGE tags. SAGE Libraries are available that contain lists of sequence tags associated with particular genes. NextGENe can load these libraries as a reference and align the sequence reads to the appropriate sequence tags. The alignment to the tag library is only performed in the forward orientation of the sequences, no reverse complementation is implemented. Digital gene expression reports are created to show the sequence of each tag, the coverage, gene names, and the location in the genome. New gene tags that are not in the library are also reported. These tags can be added to the reference library as novel tags.

The Sequence Alignment Tool has a Whole Genome View at the top of the screen, which shows each sequence of the library. Placing the mouse over the library activates a yellow box containing the biological information for the tag that is currently at the cursor. The bottom of the screen contains all reads as they have been aligned to the library.

Novel sequences are contained in this dataset. The sample reads that match sequences contained in the library are aligned appropriately, but novel sequences that are not contained in the library are also recorded. By setting a minimum threshold for new sequences, all sequences found at a frequency above the threshold are added to the end of the reference library, sample reads are aligned to them, and the Expression report shows these sequences as New Genes.
Serial Analysis of Gene Expression (SAGE) Application Note (PDF file)
Request Trial Data Analysis
Request a 30 day software trial
Small RNA Quantification and Discovery
Next Generation sequencing technologies such as the Applied Biosystems SOLiD™ System Illumina® Genome Analyzer (Solexa), and the Genome Sequencer FLX System from Roche Applied Science (454 Life Sciences) present promising opportunities for evaluating the expression of known small RNAs as well as revealing novel small RNAs. However, the volume of data and high error rates of these systems require efficient and effective software for analysis.
Your browser may not support display of this image.NextGENe’s small RNA analysis tool can be used to determine expression levels of known small RNAs as well as for discovery of novel small RNAs. Reads are aligned to a whole reference genome to determine transcript locations. Regions of high coverage are used to indicate transcript regions. These regions of the genome can be saved and used as a reference transcript sequence. Samples are then aligned to the transcript reference and coverage counts are made for each transcript.

Click for a larger view of the diagram.
After using Peak Detection Tool, Sequence Alignment window displays brown ticks to indicate regions that meet transcript requirements. This figure shows two small RNA transcripts located within Gene 6241476. Blue arrows are used to indicate gene locations in the reference file. The green and gold arrows below gene indicator identify the mRNA and coding sequence respectivley.

Click for a larger view of the diagram.
Once NextGENe completes aligning sample file(s) to the transcript reference file, the results are shown in the sequence alignment window which provides a graphic representation of expression levels for each transcript. Red lines indicate transcript boundaries. Sequence reads that align with each transcript are shown beneath where they align. Gray bars indicate coverage (expression level).

Click for a larger view of the diagram.
The Expression report displays quantitative information about each segment (transcript) including its length, the maximum and average count numbers and the read counts.
Request 30-day Trial
Small RNA Application Note (PDF file)
Sequence Analysis Using Barcode/Index Tags of Pooled Samples
Next generation sequencing technologies such as the Genome Analyzer FLX by Roche Applied Science (454 Sequencing), the SOLiD™ System from Applied Biosystems and the Illumina Genome Analyzer have drastically reduced sequencing costs while increasing speed and the quantity of information gathered. These technologies give reliable sequence read-outs of 25-75 bps for Illumina and SOLiD reads and up to 500 bp for Roche/454 reads with approximately 1-100 million reads per sequencing run. For certain applications aimed at sequencing small genomes or select regions of DNA this amounts to an excess of sequencing data producing greater coverage than necessary. The use of multiplex sample tags (or barcodes) allows for optimization of 2nd generation sequencing technologies by pooling samples and sequencing multiple samples in parallel. When barcode tags are used for multiplexing, software is needed to accurately parse sample files according to these tags prior to analysis.
NextGENe provides a biologist-friendly tool to parse instrument output files according to sequence tags. This tool has been designed for flexibility, to parse files according to tags within the read name or within the sequence read. The software is able to utilize user-provided information about sample tags or can detect tags automatically.
Following file parsing into individual sample files, NextGENe software can be used for a variety of applications such as SNP and Indel detection, de novo assembly, transcriptome analysis, SAGE studies, ChIP-Seq and small RNA discovery and quantification.

Read distribution for two lanes of Illumina multiplex data with sample tags in read names is shown. Evaluating the number of reads with each tag allows for clear differentiation between true tags and tags that are the result of sequencing errors. For this data, the Sample tags 1-8 were used for file parsing. The clear drop-off of read count between Sample ID 8 (the least common tag used for parsing) and ID 9 (the next most common tag) illustrates the distinct boundary between true tags and tags caused by error.
Bar-Coding Application Note
Deep Sequencing Analysis and Low Frequency SNP/Mutation Detection
Next Generation sequencing platforms such as the Genome Sequencer FLX from Roche Applied Science (454 Sequencing), the SOLiD™ System from Applied Biosystems and the Illumina® Genome Analyzer have drastically reduced sequencing costs while producing data at an increased speed and quantity compared to Sanger methods. The large data volume allows for the sequencing of genomes or genomic regions at very high coverage (5,000x – 20,000x). This high coverage makes it possible to detect low frequency mutations such as somatic mutations and other rare variants, like in virus infected samples. Yet, the high error rates of these technologies (1-3%) create difficulty in distinguishing instrument errors from true low frequency mutations. Because of this, unique software algorithms are required to produce accurate analysis of deep sequencing data.
NextGENe is able to analyze high coverage short read data to accurately identify low frequency variants while reducing the detection false positives. NextGENe’s unique Condensation Tool is used to reduce instrument errors while increasing read length and reducing read count. Following Condensation, the Sequence Alignment Tool is used to accurately align reads to the reference sequence and highlight variations from the reference. The Sequence Alignment Tool evaluates base calls that differ from the reference to identify possible biases to distinguish between false positives and true low frequency variants. Replicate control samples can be used to evaluate the linearity of mutation calls as shown in the following figure.

Linearity Plot for all mutations found in the replicate samples s_1 and s_2 from pooled samples of 364 patients, in 2 channels, is shown. Both samples were condensed and aligned to the reference and the mutation percentages using the number of original reads for all mutation calls are plotted. The mutation percentages are linear (R2= 0.9999) indicating the accuracy of the system and the software. All mutations greater than 1% were found in both samples.

NextGENe provides the SNP Compare Tool to easily compare the mutation calls made in two similar projects, such as case and control samples. When a mutation call is made in one project but not the other, the “Mutation Call” column is blank for the project that did not show the mutation. Additional columns including chromosomal location, gene location and allele frequencies can be shown when selected. Data shown is a selection of the SNP Compare Report produced for replicate samples.
Deep Sequencing Application Note
Targeted Enrichment Sequence Analysis
Targeted Genome Enrichment capture techniques reduce the cost and improve the efficiency of sequencing by utilizing the flexibility of microarray technology or other capture technologies and the high data throughput of Next Generation sequencing (1). Allowing researchers to focus only on sequencing of interested regions creates increased opportunity for disease study by providing greater coverage of targeted regions at a reduced time and cost.
NextGENe’s Condensation Tool significantly enhances analysis of captured data by reducing system errors while elongating short reads from the Illumina GA system and SOLiD system and correcting homopolymer errors of the Roche Genome Sequencer. The interference from regions other than target region is eliminated when the reads are aligned to the whole genome or predefined portions of the genome for their best matches. The interference reads, about 40% of total reads, across the whole human genome are extremely low concentration and are ignored in condensation step.
In a comparative study of NextGENe software and GS Reference Mapper application of the Roche Genome Sequencer Data Analysis Software, NextGENe mapped more reads to the Human Genome following capture at higher stringency settings. (See Evaluation of Targeted Enrichment Sequence Analysis Using NextGENe Software and GS Reference Mapper)
NextGENe’s Sequence Alignment module is used to accurately map reads to a reference sequence and detect variants. The alignment algorithm identifies matching positions for each read using 12-mer sequences. When a match is found with the highest uniqueness score, the alignment is extended. Once reads are aligned, mutation positions are identified and highlighted. The aligned reads, consensus sequence, reference sequence and mutation calls as well as complete annotation information, are displayed in a single view when the project is completed. Specialized reports are also available in the results.
1. T J Albert, et al. 2007. Direct selection of human genomic loci by microarray hybridization. NatureMethods. 4: 903-905.
Reducing Error Next Generation Sequencing App Note
Alignment of Paired Read Data
The alignment of paired reads (paired end or mate-paired reads) to a reference sequence improves the accuracy of alignments and allows the detection of structural variations such as large insertions and deletions, inversions and translocations. Paired reads are expected to align a known distance from each other with an expected relative orientation. Identifying pairs where the distance between reads or their orientation differs from the expected can indicate positions of structural rearrangements.
NextGENe utilizes a unique technology to align paired read data to a reference and track the distance between paired reads and the orientation of pairs. Several reports are created that provide information about the alignment of pairs and specialized display features are available for visualizing results including detected variants.

Results shown above are for the alignment of paired end Illumina data with a library size of 200bp. The Paired View is displayed below the Whole Genome View and shows that most pairs are aligned with a gap distance near 200bp. Green bars indicate pairs in a localized region that are oriented in the same direction, while blue bars indicate pairs that align in opposite orientation to each other.
Several Paired Reports are generated to provide a list of all pairs that align to the reference with a gap distance outside of the expected range. Separate reports are created for pairs in which reads align in the same direction and in opposite directions.
The Paired Reads Gap Distribution shows the distribution of pairs with continuous gap sizes. Two charts are shown; the top chart shows the gap sizes for pairs that are oriented in opposite directions while the bottom chart shows pairs that are oriented in the same direction.
Paired Read Statistics are available that show various statistics related to paired reads including matched read count and matched reads with gap distance in expected range as well as reads that matched to the reference with a mate that did not match and pairs where neither read matched.
Alignment of Paired Read Application Note
Trademarks are property of their respective owners. |