Glossary
SAPIENs
SAPIENs (Sequence-Activity Prediction In Ensemble of Networks) is
an ensemble of ten residual neural networks, which forms the core prediction model underlying rbsXpress. It was
trained on approximately 250,000 experimentally tested RBSs sequences and considers a maximum of 17 nt upstream of
the AUG start codon as input and predicts the relative translation rate (rTR) for each query with high accuracy
(R2: 0.93, mean average error: 0.039). Furthermore, it provides an estimate for the confidence of each
prediction. More information on the model as well as the underlying data generation can be found
here.
Please note that the current version does not consider sequence features outside the 17 nt upstream of AUG, which
will be addressed in newer versions.
rTR (a.u.)
The rTR (relative translation rate; a.u.) is the proxy for the relative strength of RBSs as predicted by rbsXpress.
It corresponds to the linear slope of cell-specific protein accumulation within an interval of 290 minutes after
induction. The rTR is normalized to a scale ranging from 0 (“very weak”) to 100,000 (“very strong”) a.u. for
convenience. More details can be found here.
uncertainty
Besides the rTR, rbsXpress specifies an estimate of the uncertainty of each prediction, which ranges from 0 to 100%.
The uncertainty is calculated from the deviation within the ensemble of ten neural networks that make up the core
predictor of rbsXpress (see also: SAPIENs).
It is a secondary parameter that can additionally help the user to select RBS sequences, for which the model is “most sure”
about its predictions (the lower the percentage the better).
RedLibs
RedLibs (brief for reduced libraries) is an algorithm developed to generate genetic variant libraries that uniformly
span a range of a desired numerical functional/phenotypic properties. RedLibs was initially developed for RBS
prediction data as input, which is commonly strongly skewed towards weak RBSs (i.e. low rTRs). To reach a better
distribution and hence a more efficient search for optimal expression levels, RedLibs generates subsets of RBSs from
the skewed input library, which have the following three key characteristics: first, they are encoded by a single,
degenerate sequence allowing for facile cloning and experimental implementation of the library through the use of
conventional and cheap oligos. Second, they have a user-defined size (i.e. no. of RBS variants) than the input
library, which ensure that they can efficiently screened even at low experimental throughputs. And last, the
uniformly span the entire accessible expression level/rTR range avoiding the skew and redundancy of randomly produced
libraries such as the input.
Note that RedLibs accepts not only RBS prediction data but any data composed of pairs of sequences and numerical
values as input. More details on RedLibs can be found in this publication,
which also provides examples and protocols for RBS library cloning and testing.
RBS
The ribosome binding site (RBS) is a sequence upstream of the protein coding sequence in procaryotic mRNAs and
therefore part of the 5’-UTR. It contains, the well-known Shine-Dalgarno (SD) sequence and controls the
rate-limiting initiation step in translation and thus the resulting protein level in bacterial cells. Modifying
the RBS sequence (i.e. RBS engineering) combines several attractive features such as the access to
orders-of-magnitude changes in protein levels, the relative adjustment in polycistronic mRNAs and, most importantly,
the possibility to predict their “strength” in silico using RBS sequences as queries. These features make RBSs an
excellent target for expression level engineering in Synthetic Biology and Metabolic Engineering. For more
information about RBSs and the optimization of multi-protein systems please have a look
here.
Uniformity
The uniformity is a score indicative of the quality of libraries designed by RedLibs. It can reach a maximum of
100%, which corresponds to a library that perfectly matches a uniform target distribution. Generally, values
above 80% can be considered as nicely correlated. For more details on how RedLibs evaluates and compares the
distribution of different libraries, please refer to the corresponding publication.
5’-UTR
The 5’-untranslated region (5’-UTR) is the mRNA part preceding the protein-coding sequence (CDS), i.e. reaching from
transcriptional start site (TSS) to the nucleotide prior to the start codon (for polycistronic mRNAs also the part
between two CDSs).
CDS
The protein-coding sequence (CDS) contains the primary sequence information that is translated into a polypeptide
through the ribosome. It starts with a start codon (mainly AUG, in procaryotes also GUG and CUG) and ends with a
stop codon. While not part of the RBS itself, the CDS is known to impact translation initiation, which is mainly
exerted through mRNA secondary structures. Thus, RBSs have a context dependency imposing the need for RBS
optimization in case-specific fashion. More details on sequence features that impact RBS behaviour like the CDS
can be found here.
Degenerate RBS/sequence
A degenerate RNA or DNA sequence is one that contains at least one ambiguous/variable nucleotide (= ”degenerate
nucleotide/base”). Degenerate bases can be designated using the IUPAC nucleotide code. As an example, the degenerate
sequence AGN corresponds to one (or a mix) of the sequences AGA, AGC, AGG, AGU.
IUPAC nucleotide code
The IUPAC (International Union of Pure and Applied Chemistry) nucleotide code is a standardized system to represent
nucleotide sequences with ambiguity. Despite the single-letter codes for adenine (A), cytosine (C), guanine (G),
and thymine (T), it accounts for variable nucleotides. The full code is provided hereafter:
Code |
Base(s) |
Description |
A |
A |
Adenine |
C |
C |
Cytosine |
G |
G |
Guanine |
T |
T |
Thymine (in DNA) |
U |
U |
Uracil (in RNA) |
R |
A, G |
Purines |
Y |
C, T/U |
Pyrimidines |
S |
G, C |
Strong interactions (3 H-bonds) |
W |
A, T/U |
Weak interactions (2 H-bonds) |
K |
G, T/U |
Keto |
M |
A, C |
Amino |
B |
C, G, T/U |
Not A |
D |
A, G, T/U |
Not C |
H |
A, C, T/U |
Not G |
V |
A, C, G |
Not T/U |
N |
A, C, G, T/U |
Any nucleotide (unspecified) |