Welcome to RNAcode Web, a free web server for the prediction of protein
coding regions.
The web service uses
RNAcode
as its core utility to determine the coding potential of a nucleotide
sequence. While RNAcode can be used to build complete annotations based on
a full genome alignment, the web service aims for the prediction of
smaller genomic regions of interest. Biggest disadvantage of larger input
is the presentation of the results, which will become eventually
confusing. Hence genomic regions which span many genes are inadvisable,
while technically possible.
RNAcode performs rigors statistic analysis of the evolutionary information
provided by a sequence alignment. The web service builds this sequence
alignment fully automatically and uses RNAcode to analyse it. The user
only needs to provide the region as a nucleotide sequence. The web
service uses the blast search engine to find homologous sequences from the
RefSeq DB "Representative Genome Database" or the nt DB "non redundant
nucleotide Database" chosen by the user. From the results the web service
selects a set of sequences that provide a robust evolutionary signal. This
set is used to compute a multiple sequence alignment on which RNAcode
generates an analysis.
For the analysis the assumption is made that the user provides a genomic
region as input. Hence the web service will not work with spliced regions
or RNA sequences.
Further the web service ignores paralogs, thus will only allow one
homolog per species.
In principle the RNAcode is nothing else than an annotation tool for
protein genes. Annotation tools can be roughly classified into three
categories: feature-based methods (e.g.
glimmer,
prodigal),
homology based methods (e.g. blast),
alignment based methods (RNAcode).
Compared with the feature-based and homology based methods the alignment
based annotation has some advantages and disadvantages. On the downside is
a comparable longer computation time, especially the construction of the
alignment. On the upside the method does not need any a priori
information. A homology based annotation needs a protein data base and the
feature-based method needs one or more annotated genomes from which it can
learn its features. The alignment based annotation on the other hand only
needs observed sequences (assembled genomes with no annotation). This is
in so far an advantage as both other methods are highly biased towards
known proteins. While RNAcode has no such bias and thus is well suited to
predict unknown protein sequences with unusual features.
Further the result of an alignment based annotation brings by itself
biological evidence of the protein coding potential of a sequence. As the
analysis tries to find negative evolutionary selection to conserve a
protein sequence. Which in itself is evidence coding potential. Compared
with a feature based annotation which only provides evidence in so far
that a called gene has features similar to other known genes.
Thus alignment based method shines for specific regions of interest which
only find weak or no signal by other methods. In particular for organisms
which are phylogenetically distant to many model organism or show
uncharacteristic genomic organisation.
The starting point of alignment based methods is a nucleotide alignment.
This alignment is analysed to find evidence for negative selection on the
underlying protein sequence. Hence certain mutations are much rare than
expected. Silent mutations are a straightforward example. Alignments which
only show mutations which do not change the amino acid sequence can be
expected to be protein coding. Other mutations which only change the amino
acid but to are biochemically similar (like leucin and isoleucin) are
called synonymous mutation and also provide evidence about the coding
potential. Next to these mutation RNAcode evaluates reading frame
conservation and stop codons.
RNAcode Web runs 4 major steps. While step 1. and 2. might be repeated until enough sequence could be found.
In the first step the full input sequence will be blasted against the local RefSeq or nt data base.
In the next step the sequences in the blast result will be selected to build the alignment. Further the sequences of the blast results will be extended, such that it covers the complete input sequence. This is needed as most blast results do not cover the entire query sequence.
The two first steps will be repeated up to 5times and will stop earlier
if enough sequences are found. The first hit that blast finds will be set
as a reference for the rest of the search. In the following iterations the
search will than be taxonomically restricted based on this reference. In
the second iteration only sequences belonging to the same order will
considered, in the third only from the same class, than phylum and lastly
kingdom. These restrictions speed up the blast search.
To increase the sensitivity, after each iteration the word size for the
blast size will be decreased.
The decrease of word size in combination with the taxonomic restriction
should a sure quick processing for input sequences which have easy to find
homologs, while guarantying enough sensitivity for input sequences, where
this in not the case.
After the selection of sequences from the blast result. The sequences will be aligned, with Clustal Omega.
In the last step the alignment will be used as an input for RNAcode. Which will make a prediction for protein coding regions.
There are a couple of use cases for which RNAcode Web can help. It can be used to validate and/ or find protein coding regions.
For example a genomic region shows evidence of translation and the user wants evolutionary proof that the region is truly coding. In this case the user simply provides the genomic region as a nucleotide sequence.
Further a larger stretch on the genome might be of interest, but no clear region or exon structure could be established. In this case the stretch can be used as an input and RNAcode will try to identify which frames are likely coding.
For long input regions the sequence is split into multiple overlapping smaller sequence. For each sequence an independent job is started.
Each of this jobs gets the numbered suffix "-child_n". On the webpage
for the parent job information and results of the children will be
aggregated. If results in child jobs overlap they will be merged.
If a job is split up into more than 3 child jobs, a custom blast data base
is constructed based on the input sequence. The custom data base is
constructed from the original NCBI data base but only contains parts of
homologous sequences. This decreases the time for
each child process dramatically.
The procedure is closely related to the work-flow of a normal job. Blast searches and sequence selections will be done in possibly multiple iterations. The blast searches will decrease the word size in every iteration. While once a reference is found (this should be normally the case after the first search) a taxonomic restriciton is imposed on the search. This taxonomic restriction is relaxed every following iteration. In the sequence selection step the extension of the target sequence is increased to include varying exon distances and gene synteny.
The three parameters relevant for the pipeline are:
The data base parameter sets the nucleotide data base on which the analysis is based.
The distance parameters that the user can set for a job, are used in the second step, sequence selection. In this step the sequences retrieved by blast are selected, to build an optimal set for the alignment. Thus we explain in detail the steps of the sequence selection and highlight when each parameter is used.
The user provides the web service with an upper and lower bound of
distance between the sequences in the alignment set. While the minimal
pairwise distance is the minimal distance between any two sequences in the
set of selected sequences. The maximal pairwise distance only applys
between any sequence and the input sequence.
The distance is computed by building a local alignment between
two sequences using the following scoring scheme:
From this alignment the distance is defined as: $$dist = 1 - { { \# matches} \over length(shorter\_sequence)} * 100$$
The selection runs through multiple steps.
Here the user can provide an email address which will receive a
notification when the job ended.
Also a unique name for the job must be provided. This name is used to
identify the job, by the web service and the user.
The web service provides a couple of representations of the RNAcode results if the job finished successfully. The results can be separated into alignment results and RNAcode results. If RNAcode did not find any significant coding region only the alignment results are shown. If no alignment could be constructed no results are shown. In both cases the job is considered a failure.
For each result set a plot with the High Scoring Segments (HSS) is generated. The segments are drawn as arrows with there frame decoded in different colors. The name of the HSS is are drawn next to the arrows. If the job was split up into different child processes, you can see which part corresponds with which child at the bottom.
The main result of the web service is a table, showing each segment that has coding potential. Each row shows one potential coding region. The table has 11. fields, which are described here. Multiple HSS can be merged into one if different child process have an overlapping segment in the same frame.
A protein list is shown for each HSS. The list can be downloaded as a fasta file.
The results can be filtered by the p-value and that only the best
scoring segments should be shown if they overlap with others.
The results will be than update accordingly. Often high scoring segments
show a statical "shadow" on the opposite strand. That is due to a region
that overlaps a coding region is not independent in there codon
distribution. Normally these segments score worse than the correct region.
Also while some examples are known it is unusual that two coding regions
overlap and are in different frames.
The results show information about the input alignment for RNAcode,
even if no coding regions could be predicted.
If the job was split into multiple child process a table with simple
alignment information for each child is shown. Consisting of the child id,
number of sequences in the alignment, names of species in the alignment and
if RNAcode was successful.
If the job was not split an alignment tree and the full alignment are
shown.
For jobs which were not split a phylogenetic tree is shown which was the alignment guide tree for Clustal Omega. The tree was build by Clustal Omega it self.
Further the user might download the folder containing all results which were created for the analysis. This includes the necessary scripts to reproduce the calculation and all intermediate results.
TODO Publish or Perish