Documentation

Table of content

Introduction

Principle of RNAcode

Workflow

Use cases

Job splitting and parallelization

Introduction

Welcome to RNAcode Web, a free web server for the prediction of protein coding regions.
The web service uses RNAcode as its core utility to determine the coding potential of a nucleotide sequence. While RNAcode can be used to build complete annotations based on a full genome alignment, the web service aims for the prediction of smaller genomic regions of interest. Biggest disadvantage of larger input is the presentation of the results, which will become eventually confusing. Hence genomic regions which span many genes are inadvisable, while technically possible.
RNAcode performs rigors statistic analysis of the evolutionary information provided by a sequence alignment. The web service builds this sequence alignment fully automatically and uses RNAcode to analyse it. The user only needs to provide the region as a nucleotide sequence. The web service uses the blast search engine to find homologous sequences from the RefSeq DB "Representative Genome Database" or the nt DB "non redundant nucleotide Database" chosen by the user. From the results the web service selects a set of sequences that provide a robust evolutionary signal. This set is used to compute a multiple sequence alignment on which RNAcode generates an analysis.
For the analysis the assumption is made that the user provides a genomic region as input. Hence the web service will not work with spliced regions or RNA sequences.
Further the web service ignores paralogs, thus will only allow one homolog per species.

Principle of RNAcode

In principle the RNAcode is nothing else than an annotation tool for protein genes. Annotation tools can be roughly classified into three categories: feature-based methods (e.g. glimmer, prodigal), homology based methods (e.g. blast), alignment based methods (RNAcode).
Compared with the feature-based and homology based methods the alignment based annotation has some advantages and disadvantages. On the downside is a comparable longer computation time, especially the construction of the alignment. On the upside the method does not need any a priori information. A homology based annotation needs a protein data base and the feature-based method needs one or more annotated genomes from which it can learn its features. The alignment based annotation on the other hand only needs observed sequences (assembled genomes with no annotation). This is in so far an advantage as both other methods are highly biased towards known proteins. While RNAcode has no such bias and thus is well suited to predict unknown protein sequences with unusual features.
Further the result of an alignment based annotation brings by itself biological evidence of the protein coding potential of a sequence. As the analysis tries to find negative evolutionary selection to conserve a protein sequence. Which in itself is evidence coding potential. Compared with a feature based annotation which only provides evidence in so far that a called gene has features similar to other known genes.
Thus alignment based method shines for specific regions of interest which only find weak or no signal by other methods. In particular for organisms which are phylogenetically distant to many model organism or show uncharacteristic genomic organisation.
The starting point of alignment based methods is a nucleotide alignment. This alignment is analysed to find evidence for negative selection on the underlying protein sequence. Hence certain mutations are much rare than expected. Silent mutations are a straightforward example. Alignments which only show mutations which do not change the amino acid sequence can be expected to be protein coding. Other mutations which only change the amino acid but to are biochemically similar (like leucin and isoleucin) are called synonymous mutation and also provide evidence about the coding potential. Next to these mutation RNAcode evaluates reading frame conservation and stop codons.

Workflow

RNAcode Web runs 4 major steps. While step 1. and 2. might be repeated until enough sequence could be found.

Blast
Sequence Selection
Alignments
RNAcode

1. Blast

In the first step the full input sequence will be blasted against the local RefSeq or nt data base.

2. Sequence selection

In the next step the sequences in the blast result will be selected to build the alignment. Further the sequences of the blast results will be extended, such that it covers the complete input sequence. This is needed as most blast results do not cover the entire query sequence.

Repeat 1. and 2.

The two first steps will be repeated up to 5times and will stop earlier if enough sequences are found. The first hit that blast finds will be set as a reference for the rest of the search. In the following iterations the search will than be taxonomically restricted based on this reference. In the second iteration only sequences belonging to the same order will considered, in the third only from the same class, than phylum and lastly kingdom. These restrictions speed up the blast search.
To increase the sensitivity, after each iteration the word size for the blast size will be decreased.
The decrease of word size in combination with the taxonomic restriction should a sure quick processing for input sequences which have easy to find homologs, while guarantying enough sensitivity for input sequences, where this in not the case.

3. Alignment

After the selection of sequences from the blast result. The sequences will be aligned, with Clustal Omega.

4. RNAcode

In the last step the alignment will be used as an input for RNAcode. Which will make a prediction for protein coding regions.

Use cases

There are a couple of use cases for which RNAcode Web can help. It can be used to validate and/ or find protein coding regions.

For example a genomic region shows evidence of translation and the user wants evolutionary proof that the region is truly coding. In this case the user simply provides the genomic region as a nucleotide sequence.

Further a larger stretch on the genome might be of interest, but no clear region or exon structure could be established. In this case the stretch can be used as an input and RNAcode will try to identify which frames are likely coding.

Job splitting and parallelization

For long input regions the sequence is split into multiple overlapping smaller sequence. For each sequence an independent job is started.

Each of this jobs gets the numbered suffix "-child_n". On the webpage for the parent job information and results of the children will be aggregated. If results in child jobs overlap they will be merged.
If a job is split up into more than 3 child jobs, a custom blast data base is constructed based on the input sequence. The custom data base is constructed from the original NCBI data base but only contains parts of homologous sequences. This decreases the time for each child process dramatically.

The procedure is closely related to the work-flow of a normal job. Blast searches and sequence selections will be done in possibly multiple iterations. The blast searches will decrease the word size in every iteration. While once a reference is found (this should be normally the case after the first search) a taxonomic restriciton is imposed on the search. This taxonomic restriction is relaxed every following iteration. In the sequence selection step the extension of the target sequence is increased to include varying exon distances and gene synteny.

Parameters

The three parameters relevant for the pipeline are:

Maximal pairwise distance
Minimal pairwise distance
Data base type

The data base parameter sets the nucleotide data base on which the analysis is based.

Distance

The distance parameters that the user can set for a job, are used in the second step, sequence selection. In this step the sequences retrieved by blast are selected, to build an optimal set for the alignment. Thus we explain in detail the steps of the sequence selection and highlight when each parameter is used.

The user provides the web service with an upper and lower bound of distance between the sequences in the alignment set. While the minimal pairwise distance is the minimal distance between any two sequences in the set of selected sequences. The maximal pairwise distance only applys between any sequence and the input sequence.
The distance is computed by building a local alignment between two sequences using the following scoring scheme:

Match: +1
Mismatch: -1
Open gap: -1
Extend gap:-1

From this alignment the distance is defined as: $$dist = 1 - { { \# matches} \over length(shorter\_sequence)} * 100$$

The selection runs through multiple steps.

All results are removed that are to close to the input sequence, based on the minimal pairwise distance.
All results are removed that are to far away from the input sequence, based on the maximal pairwise distance.
A density-based clustering algorithm is used to remove sequences such that no two sequence in the set have a pairwise distance below the minimal pairwise distance

Query Information

Here the user can provide an email address which will receive a notification when the job ended.
Also a unique name for the job must be provided. This name is used to identify the job, by the web service and the user.

Results

The web service provides a couple of representations of the RNAcode results if the job finished successfully. The results can be separated into alignment results and RNAcode results. If RNAcode did not find any significant coding region only the alignment results are shown. If no alignment could be constructed no results are shown. In both cases the job is considered a failure.

Coding Regions

High Scoring Segment plot

For each result set a plot with the High Scoring Segments (HSS) is generated. The segments are drawn as arrows with there frame decoded in different colors. The name of the HSS is are drawn next to the arrows. If the job was split up into different child processes, you can see which part corresponds with which child at the bottom.

Example of a high scoring segment plot.

High Scoring Segment tabele

The main result of the web service is a table, showing each segment that has coding potential. Each row shows one potential coding region. The table has 11. fields, which are described here. Multiple HSS can be merged into one if different child process have an overlapping segment in the same frame.

1. HSS id: Unique running number for each high scoring segment predicted in one RNAcode call.
2. Strand: The strand of segment. Minus indicates that the predicted region is on the reverse complement strand.
3. Frame: The reading frame phasing relative to the starting nucleotide position in the reference sequence. 1 means that the first nucleotide in the reference sequence is in the same frame as the predicted coding region.
4. Length: The length of the predicted region in amino acids
7. Start 8. End: The nucleotide position in the reference sequence of the predicted coding region. The first nucleotide position in the references sequence is set to 1.
9. Score: The coding potential score. High scores indicate high coding potential.
10. P: The p-value associated with the score. This is the probability that a random segment with same properties contains an equally good or better hit.
11. Show Results: Plot of the high scoring segment.

Predicted protein sequence

A protein list is shown for each HSS. The list can be downloaded as a fasta file.

Filter regions

The results can be filtered by the p-value and that only the best scoring segments should be shown if they overlap with others.
The results will be than update accordingly. Often high scoring segments show a statical "shadow" on the opposite strand. That is due to a region that overlaps a coding region is not independent in there codon distribution. Normally these segments score worse than the correct region. Also while some examples are known it is unusual that two coding regions overlap and are in different frames.

Alignment Info

The results show information about the input alignment for RNAcode, even if no coding regions could be predicted.
If the job was split into multiple child process a table with simple alignment information for each child is shown. Consisting of the child id, number of sequences in the alignment, names of species in the alignment and if RNAcode was successful.
If the job was not split an alignment tree and the full alignment are shown.

Alignment Tree

For jobs which were not split a phylogenetic tree is shown which was the alignment guide tree for Clustal Omega. The tree was build by Clustal Omega it self.

Download full results

Further the user might download the folder containing all results which were created for the analysis. This includes the necessary scripts to reproduce the calculation and all intermediate results.

Reference

Blast: Website
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990).
Basic local alignment search tool.
Journal of molecular biology, 215(3), 403–410.
https://doi.org/10.1016/S0022-2836(05)80360-2
Clustal Omega: Website
Sievers F., Wilm A., Dineen D., Gibson T.J., Karplus K., Li W., Lopez R., McWilliam H., Remmert M., Söding J., Thompson J.D. and Higgins D.G. (2011)
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.
Mol. Syst. Biol. 7:539
https://doi.org/10.1038/msb.2011.75
RNAcode: Website
Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N.
RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data.
RNA. 2011 Apr;17(4):578-94.
https://doi.org/10.1261/rna.2536111

Software

Blast: version: 2.13.0+ https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.13.0/
Clustal Omega: version: 1.2.4 http://www.clustal.org/omega/
RNAcode: version: 0.3 https://github.com/ViennaRNA/RNAcode
R: version 4.1.1 https://cran.r-project.org/src/base/R-4/; ggplot2; version: 3.3.5; ggrepel; version: 3.3.5
python (webserver): version:3.6.8 https://www.python.org/downloads/release/python-368
filelock: version:3.4.1
coolname: version:1.1.0
validate_email: version:1.3
flask_limiter: version:1.5
apscheduler: version:3.9.1
ete3: version:3.1.2
jinja2: version:3.0.3
Bio: version:1.79
redis: version:4.3.4
python (cluster): version:3.7.7 https://www.python.org/downloads/release/python-377
ete3: version:3.1.2
Bio: version:1.78

Citing

TODO Publish or Perish