Given a library of *L* sequences, comprising variants of a sequence
of *N* nucleotides, into which random point mutations have been introduced,
we wish to calculate the expected number of distinct sequences in the library.
(Typically assuming *L* > 10, *N* > 5, and the mean number
of mutations per sequence *m* < 0.1 x *N*).

Saab-Rincon et al (2001, *Protein Eng.*, **14**, 149-155) constructed
a library of 5 million clones with a single round of epPCR on a 700 bp
gene. Sequencing 10 of these, indicated an error rate of 3-4 nucleotide
substitutions per daughter sequence. Entering *L*
= 5000000, *N* =
700 and *m* = 3.5 into the base PEDEL sever page, and clicking 'Calculate',
shows that the expected number of distinct sequences in the library is
4.153 x 10^6, or about 4.2 million.

If you follow the link to 'detailed statistics' and, once again,
enter *L* = 5000000, *N* = 700 and *m* = 3.5 and click 'Calculate',
you get a breakdown of library statistics for each of the sub-libraries
comprising all those daughter sequences with exactly *x* base substitutions
(*x* = 0, 1, 2, 3, ...).

For example the first line of the table shows that **Px** = 3.02% of
the library (i.e. **Lx** = 1.51 x 10^5 daughter sequences) have *x* =
0 base substitutions (i.e. they are identical to the parent sequence).
The total number of possible variants with 0 base substitutions is, of
course, **Vx** = 1 (just the parent sequence) and the total number of
distinct sequences with 0 base substitutions present in the library is,
similarly, **Cx** = 1. The completeness of the *x* = 0 sub-library
is **Cx/Vx** = 100%. The redundancy of this sub-library - i.e. wasted
duplication - is **Lx-Cx** = 1.51 x 10^5.

You also have the option to plot this data by following the 'Plot
this data' link. Choose the statistic to plot and whether or not to
use a logscale on the y-axis. For example, a plot of **Px** or **Lx**
gives
a Poisson distribution. A plot of **Vx** shows how the number of possible
variants increases very rapidly as the number of base substitutions is
increased. A plot of **Cx** shows how the expected number of distinct
sequences in the sub-libraries initially increases - limited by the number
of possible variants, **Vx** - and then decreases - limited by the size
of the sub-library, **Lx**. A plot of **Lx-Cx** shows the extent of
wasted duplication in the lower *x*-value sub-libraries.

Returning to the base PEDEL server page, you can follow links to plot the expected number of distinct sequences in a library for a range of mutation rates, library sizes or sequence lengths. The third option probably won't be very useful, but the first two will help you to decide what library size to aim for in order to obtain a given diversity, and what mutation rate to use to maximize the diversity for a given library size.

For example, follow the 'mutation rates' link, enter *L* =
5000000, *N* = 700 and *m* = 0.2 - 20, and click 'Calculate'.
From the plot, you can see that the expected number of distinct sequences
increases rapidly with *m* until *m* ~ 5, and then levels off with
< 10% redundancy in the library. On the other hand, if you chose *m* ~
1.5, then the library would be about 60% redundant. After selecting an
optimal mutation rate *m*, you can go back to the 'detailed statistics'
page to check the expected completeness of the *x* = 0, 1, 2, 3, ...
sub-libraries.

PEDEL uses a generic Poisson model of sequence mutations. There are a couple of simplifications that you should be aware of:

- All base substitution are assumed equally likely. In reality, under error-prone
conditions, the polymerase favours some substitutions over others. This
has the effect of reducing the expected number of distinct sequences compared
with the PEDEL predictions. This is in fact not as big an issue as you
might expect. Using the notation from the 'detailed statistics'
page (see link on base PEDEL server page), this is not an issue when the
number of possible variants
**Vx**is much greater than the sub-library size**Lx**(i.e. large*x*values), since here there are so many possible variants that there is little duplication within the sub-library even if there is strong bias. Conversely, if**Lx**is much greater than**Vx**(i.e. small*x*values) then, unless the bias is very strong, nearly all the possible variants will still be sampled. Note that it is now possible, by using sequential PCR amplifications with two different polymerases that have opposite substitution biases, to produce unbiased libraries. - Inherent to the PCR process used to produce epPCR libraries, is amplification
bias: any mutation introduced in an early PCR cycle, will be present in
a significant fraction of the final library. In practice, researchers use
a variety of techniques to reduce amplification bias - e.g. reduce the
number of epPCR cycles and combine a number of individual libraries. For
example, one might start with 10^9 identical parent sequences; amplify
them in an epPCR to 10^15 sequences; and, after ligation and transformation
of
*E. coli*, end up with a library of 10^7 sequences. Any amplification bias would have a maximum frequency of only 1 in 10^9 so would not show up in the final library. - During the PCR cycles, different parent sequences may be amplified a different
number of times. However, empirically, the end result is a library with
a Poisson distribution of mutations (e.g. Cadwell R.C., Joyce G.F., 1992,
Randomization of genes by PCR mutagenesis,
*PCR Methods Appl.*,**2**, 28-33).**But see also this note .** - Any biases in library construction will decrease the actual number of distinct variants represented in the library. In such cases, PEDEL provides the user with a useful upper bound on the diversity present in the library.

A good review of the sources of bias in epPCR (and other directed evolution
protocols) can be found in Neylon C., 2004, Chemical and biochemical strategies
for the randomization of protein encoding DNA sequences: library construction
methods for directed evolution, *Nucleic Acids Res.*, **32**, 1448-1459.

Library size

Seq. length

nt

Load

Mutations/seq

PCR cycles

PCR efficiency

Oh Snap. Something went
wrong