Glossary

SAPIENs

SAPIENs (Sequence-Activity Prediction In Ensemble of Networks) is an ensemble of ten residual neural networks, which forms the core prediction model underlying rbsXpress. It was trained on approximately 250,000 experimentally tested RBSs sequences and considers a maximum of 17 nt upstream of the AUG start codon as input and predicts the relative translation rate (rTR) for each query with high accuracy (R2: 0.93, mean average error: 0.039). Furthermore, it provides an estimate for the confidence of each prediction. More information on the model as well as the underlying data generation can be found here.

Please note that the current version does not consider sequence features outside the 17 nt upstream of AUG, which will be addressed in newer versions.


rTR (a.u.)

The rTR (relative translation rate; a.u.) is the proxy for the relative strength of RBSs as predicted by rbsXpress. It corresponds to the linear slope of cell-specific protein accumulation within an interval of 290 minutes after induction. The rTR is normalized to a scale ranging from 0 (“very weak”) to 100,000 (“very strong”) a.u. for convenience. More details can be found here.


uncertainty

Besides the rTR, rbsXpress specifies an estimate of the uncertainty of each prediction, which ranges from 0 to 100%. The uncertainty is calculated from the deviation within the ensemble of ten neural networks that make up the core predictor of rbsXpress (see also: SAPIENs). It is a secondary parameter that can additionally help the user to select RBS sequences, for which the model is “most sure” about its predictions (the lower the percentage the better).


RedLibs

RedLibs (brief for reduced libraries) is an algorithm developed to generate genetic variant libraries that uniformly span a range of a desired numerical functional/phenotypic properties. RedLibs was initially developed for RBS prediction data as input, which is commonly strongly skewed towards weak RBSs (i.e. low rTRs). To reach a better distribution and hence a more efficient search for optimal expression levels, RedLibs generates subsets of RBSs from the skewed input library, which have the following three key characteristics: first, they are encoded by a single, degenerate sequence allowing for facile cloning and experimental implementation of the library through the use of conventional and cheap oligos. Second, they have a user-defined size (i.e. no. of RBS variants) than the input library, which ensure that they can efficiently screened even at low experimental throughputs. And last, the uniformly span the entire accessible expression level/rTR range avoiding the skew and redundancy of randomly produced libraries such as the input.

Note that RedLibs accepts not only RBS prediction data but any data composed of pairs of sequences and numerical values as input. More details on RedLibs can be found in this publication, which also provides examples and protocols for RBS library cloning and testing.


RBS

The ribosome binding site (RBS) is a sequence upstream of the protein coding sequence in procaryotic mRNAs and therefore part of the 5’-UTR. It contains, the well-known Shine-Dalgarno (SD) sequence and controls the rate-limiting initiation step in translation and thus the resulting protein level in bacterial cells. Modifying the RBS sequence (i.e. RBS engineering) combines several attractive features such as the access to orders-of-magnitude changes in protein levels, the relative adjustment in polycistronic mRNAs and, most importantly, the possibility to predict their “strength” in silico using RBS sequences as queries. These features make RBSs an excellent target for expression level engineering in Synthetic Biology and Metabolic Engineering. For more information about RBSs and the optimization of multi-protein systems please have a look here.


Uniformity

The uniformity is a score indicative of the quality of libraries designed by RedLibs. It can reach a maximum of 100%, which corresponds to a library that perfectly matches a uniform target distribution. Generally, values above 80% can be considered as nicely correlated. For more details on how RedLibs evaluates and compares the distribution of different libraries, please refer to the corresponding publication.


5’-UTR

The 5’-untranslated region (5’-UTR) is the mRNA part preceding the protein-coding sequence (CDS), i.e. reaching from transcriptional start site (TSS) to the nucleotide prior to the start codon (for polycistronic mRNAs also the part between two CDSs).


CDS

The protein-coding sequence (CDS) contains the primary sequence information that is translated into a polypeptide through the ribosome. It starts with a start codon (mainly AUG, in procaryotes also GUG and CUG) and ends with a stop codon. While not part of the RBS itself, the CDS is known to impact translation initiation, which is mainly exerted through mRNA secondary structures. Thus, RBSs have a context dependency imposing the need for RBS optimization in case-specific fashion. More details on sequence features that impact RBS behaviour like the CDS can be found here.


Degenerate RBS/sequence

A degenerate RNA or DNA sequence is one that contains at least one ambiguous/variable nucleotide (= ”degenerate nucleotide/base”). Degenerate bases can be designated using the IUPAC nucleotide code. As an example, the degenerate sequence AGN corresponds to one (or a mix) of the sequences AGA, AGC, AGG, AGU.


IUPAC nucleotide code

The IUPAC (International Union of Pure and Applied Chemistry) nucleotide code is a standardized system to represent nucleotide sequences with ambiguity. Despite the single-letter codes for adenine (A), cytosine (C), guanine (G), and thymine (T), it accounts for variable nucleotides. The full code is provided hereafter:

Code Base(s) Description
A A Adenine
C C Cytosine
G G Guanine
T T Thymine (in DNA)
U U Uracil (in RNA)
R A, G Purines
Y C, T/U Pyrimidines
S G, C Strong interactions (3 H-bonds)
W A, T/U Weak interactions (2 H-bonds)
K G, T/U Keto
M A, C Amino
B C, G, T/U Not A
D A, G, T/U Not C
H A, C, T/U Not G
V A, C, G Not T/U
N A, C, G, T/U Any nucleotide (unspecified)

Usage data
Queue length
0
Summary statistics
Jobs this month
0
Jobs all time
92
Registered users
7