Warning

Matteo just upgraded both Bootstrap and Font Awesome, and did a major change to the Mako templating so bugginess is a given.

Pedel comes in two flavours.

Choose:

Pedel

Programme for Estimating Diversity in Error-prone PCR Libraries

Description

Given a library of L sequences, comprising variants of a sequence of N nucleotides, into which random point mutations have been introduced, we wish to calculate the expected number of distinct sequences in the library. (Typically assuming L > 10, N > 5, and the mean number of mutations per sequence m < 0.1 x N).

Example ( show)

Saab-Rincon et al (2001, Protein Eng., 14, 149-155) constructed a library of 5 million clones with a single round of epPCR on a 700 bp gene. Sequencing 10 of these, indicated an error rate of 3-4 nucleotide substitutions per daughter sequence. Entering L = 5000000, N = 700 and m = 3.5 into the base PEDEL sever page, and clicking 'Calculate', shows that the expected number of distinct sequences in the library is 4.153 x 10^6, or about 4.2 million.

If you follow the link to 'detailed statistics' and, once again, enter L = 5000000, N = 700 and m = 3.5 and click 'Calculate', you get a breakdown of library statistics for each of the sub-libraries comprising all those daughter sequences with exactly x base substitutions (x = 0, 1, 2, 3, ...).

For example the first line of the table shows that Px = 3.02% of the library (i.e. Lx = 1.51 x 10^5 daughter sequences) have x = 0 base substitutions (i.e. they are identical to the parent sequence). The total number of possible variants with 0 base substitutions is, of course, Vx = 1 (just the parent sequence) and the total number of distinct sequences with 0 base substitutions present in the library is, similarly, Cx = 1. The completeness of the x = 0 sub-library is Cx/Vx = 100%. The redundancy of this sub-library - i.e. wasted duplication - is Lx-Cx = 1.51 x 10^5.

You also have the option to plot this data by following the 'Plot this data' link. Choose the statistic to plot and whether or not to use a logscale on the y-axis. For example, a plot of Px or Lx gives a Poisson distribution. A plot of Vx shows how the number of possible variants increases very rapidly as the number of base substitutions is increased. A plot of Cx shows how the expected number of distinct sequences in the sub-libraries initially increases - limited by the number of possible variants, Vx - and then decreases - limited by the size of the sub-library, Lx. A plot of Lx-Cx shows the extent of wasted duplication in the lower x-value sub-libraries.

Returning to the base PEDEL server page, you can follow links to plot the expected number of distinct sequences in a library for a range of mutation rates, library sizes or sequence lengths. The third option probably won't be very useful, but the first two will help you to decide what library size to aim for in order to obtain a given diversity, and what mutation rate to use to maximize the diversity for a given library size.

For example, follow the 'mutation rates' link, enter L = 5000000, N = 700 and m = 0.2 - 20, and click 'Calculate'. From the plot, you can see that the expected number of distinct sequences increases rapidly with m until m ~ 5, and then levels off with < 10% redundancy in the library. On the other hand, if you chose m ~ 1.5, then the library would be about 60% redundant. After selecting an optimal mutation rate m, you can go back to the 'detailed statistics' page to check the expected completeness of the x = 0, 1, 2, 3, ... sub-libraries.

Caveats ( show)

PEDEL uses a generic Poisson model of sequence mutations. There are a couple of simplifications that you should be aware of:

  • All base substitution are assumed equally likely. In reality, under error-prone conditions, the polymerase favours some substitutions over others. This has the effect of reducing the expected number of distinct sequences compared with the PEDEL predictions. This is in fact not as big an issue as you might expect. Using the notation from the 'detailed statistics' page (see link on base PEDEL server page), this is not an issue when the number of possible variants Vx is much greater than the sub-library size Lx (i.e. large x values), since here there are so many possible variants that there is little duplication within the sub-library even if there is strong bias. Conversely, if Lx is much greater than Vx (i.e. small x values) then, unless the bias is very strong, nearly all the possible variants will still be sampled. Note that it is now possible, by using sequential PCR amplifications with two different polymerases that have opposite substitution biases, to produce unbiased libraries.
  • Inherent to the PCR process used to produce epPCR libraries, is amplification bias: any mutation introduced in an early PCR cycle, will be present in a significant fraction of the final library. In practice, researchers use a variety of techniques to reduce amplification bias - e.g. reduce the number of epPCR cycles and combine a number of individual libraries. For example, one might start with 10^9 identical parent sequences; amplify them in an epPCR to 10^15 sequences; and, after ligation and transformation of E. coli, end up with a library of 10^7 sequences. Any amplification bias would have a maximum frequency of only 1 in 10^9 so would not show up in the final library.
  • During the PCR cycles, different parent sequences may be amplified a different number of times. However, empirically, the end result is a library with a Poisson distribution of mutations (e.g. Cadwell R.C., Joyce G.F., 1992, Randomization of genes by PCR mutagenesis, PCR Methods Appl., 2, 28-33). But see also this note .
  • Any biases in library construction will decrease the actual number of distinct variants represented in the library. In such cases, PEDEL provides the user with a useful upper bound on the diversity present in the library.
Please refer to Patrick W. M., Firth A. E., Blackburn J.M., 2003, User-friendly algorithms for estimating completeness and diversity in randomized protein-encoding libraries, Protein Eng., 16, 451-457 for further discussion of PEDEL.

A good review of the sources of bias in epPCR (and other directed evolution protocols) can be found in Neylon C., 2004, Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution, Nucleic Acids Res., 32, 1448-1459.

Input

Library size

Seq. length
nt

Load
Mutations/seq


PCR cycles
PCR efficiency


PedelAA

PedelAA

Description

Short paragraph goes here.

Overview ( show)

PEDEL-AA is an extension to amino acid sequences of the original nucleotide version of PEDEL (see links to publications and mathematics notes from the statistics home page). Due to the more complex problem of estimating diversity and completeness at the amino acid level (as opposed to nucleotides and codons), there are some major differences in the algorithms and a few extra approximations. A brief description of the procedure follows.

First, to recapitulate the nucleotide version of PEDEL: As discussed in the mathematics notes, for the nucleotide version, if the input library is conceptually divided into sub-libraries Lx (x=0,1,2,...) where the sub-library Lx comprises all variants in the library with exactly x nucleotide substitutions, then the PEDEL scenario divides into two regions:

  1. where Lx is large compared with Vx (the number of possible variants with exactly x substitutions) then Cx (the expected number of distinct sequences in Lx) is approximately equal to Vx, and
  2. where Lx is small compared with Vx, then Cx is approximately equal to Lx.

In the transition region (where Lx ~ Vx) we can calculate Cx with the formula Cx ~ Vx(1-exp(-Lx/Vx)). This is based on the assumption that all variants in Vx are equiprobable, so the mean number of occurrences in the sub-library Lx of each variant in Vx is Lx/Vx and, assuming Poisson statistics, the probability that any given variant is present in the sub-library is 1-exp(-Lx/Vx), so the expected number of distinct variants present in the sub-library is Cx ~ Vx(1-exp(-Lx/Vx)).

In the more complex scenario presented in PEDEL-AA, the assumption of equiprobable variants breaks down for two reasons: (i) we have introduced a full 4 x 4 nucleotide substitution matrix (in particular the transition:transversion ratio is not assumed to be unity), and (ii) even if nucleotide substitutions were equiprobable, the corresponding amino acid substitutions are not. However we may still borrow some concepts from the equiprobable nucleotide version of PEDEL - namely, (1) when Lx is small compared with Vx, then Cx is approximately equal to Lx, and (2) when Lx is large compared with Vx then Cx is approximately equal to Vx. However these concepts need some refining, as follows.

  1. Instead of x counting the number of nucleotide substitutions, we now let x count the number of amino acid substitutions (i.e. non-synonymous, non-stop codon, substitutions). Note that the substitution rate on the PEDEL-AA input form is the mean number of nucleotide substitutions, nsubst, per variant but, using this, the programme calculates the expected mean number of nonsynonymous amino acid substitutions per variant. The probability distribution, P(x_nt), of the number of nucleotide substitutions, x_nt, per variant may be assumed to follow the PCR distribution (with mean nsubst, and the extra parameters ncycles and eff, respectively the number of PCR cycles and the PCR efficiency). Separate statistics are also calculated assuming that the number of nucleotide substitutions per variant follows a Poisson distribution (with the single parameter nsubst), and these results may be used if the PCR ncycles and eff parameters are unknown.
  2. The probabilities of variants being truncated (i.e. containing introduced stop codons) are then subtracted from the P(x_nt) distributions. Clearly this is an increasing function of x_nt and, by x_nt = 100, typically less than 0.6% of variants are stop-codon free. Note that the P(x_nt) will no longer sum up to unity; instead (after discarding indel-containing variants) they sum up to L_eff / L_tot, where L_eff is the 'effective' library size (i.e. the number of variants with no indels or stop codons) and L_tot is the total library size (albeit again excluding variants with indels).

    Next, the Poisson and PCR P(x_nt) distributions are redistributed into amino acid P(x_aa) distributions. First the mean number, frac, of nonsynonymous amino acid substitutions per nucleotide substitution (given that the nucleotide substitution doesn't produce a stop codon) is calculated. Typically frac ~ 2/3. For each x_nt, the number of nonsynonymous amino acid substitutions resulting from exactly x_nt nucleotide substitutions is assumed to follow a binomial distribution, B(x_nt,frac) (i.e. x_nt 'trials'; probability of 'success' per 'trial' = frac). Summing up the binomial distributions, each weighted by P(x_nt), for x_nt = 0,1,2,...,100 gives the amino acid Poisson and PCR P(x_aa) distributions. Of course, Sum_{x_nt} P(x_nt) = Sum_{x_aa} P(x_aa) = L_eff / L_tot. The Poisson and PCR amino acid sub-library sizes, Lx, are given by P(x_aa) x L_tot.

    All these estimates rely on the mean number of nucleotide substitutions per variant, nsubst, being relatively small compared with the number of codons in the sequence, so that multiple substitutions in the same codon are not very common. In practice, we limit nsubst <= 0.1 x input sequence length (in nucleotides). In fact, for the Poisson case, we can calculate L0, L1 and L2 exactly (a sum over all possible variants with exactly 0, 1 or 2 amino acid substitutions, multiplied by their probabilities given by the input nucleotide substitution matrix and nsubst, multiplied by L_tot). These calculations agree very well with the 'sum of binomial distributions' method. For example, for the library presented in Volles & Lansbury (2005), we have

                    'exact'       'binomial sum'
                    L0     3.763e+05         3.861e+05
                    L1     8.174e+05         8.205e+05
                    L2     8.795e+05         8.717e+05
                    
  3. Two estimates of Vx are calculated. The first, Vx_1, is an estimate of the number of 'common' variants with exactly x amino acid substitutions - namely those variants where each substituted amino acid is accessible by just a single nucleotide substitution in the respective codon. The second, Vx_2, is the total number of possible amino acid variants with exactly x amino acid substitutions. Although most variants in a sub-library will be of type Vx_1, variants of type Vx_2 may contribute significantly to the total number of distinct variants Cx when the sub-library size Lx is large compared with the number of variants Vx_1. When the sub-library size Lx is small compared with the number of variants Vx_1, nearly every variant in Lx is distinct (i.e. Cx ~ Lx) and it doesn't matter whether these variants are of type Vx_1 or Vx_2.
    • Vx_2 is given by C(N,x) 19^x, where C(N,x) is the combinatorial function, i.e. C(N,x) = N!/[x!(N-x)!], and N is the length of the input sequence in codons. (Cf equation (5) of the mathematical notes for the nucleotide version of PEDEL.)
    • We estimate Vx_1 with the formula C(N,x) A^x, where A is the mean number of non-synonymous amino acid substitutions accessible by a single nucleotide substitution, for the input sequence (the value of A is given on the results page).
    Calculation of A: For each codon, i, in the parent sequence, the number of non-synonymous, non-stop codon, amino acid substitutions Ai, i=1,...,N (N = parent sequence length in codons) accessible by a single nucleotide substitution is calculated. Such substitutions are accessible with similar probability (albeit varying by a factor of ~3 for a typical transition:transversion ratio) and much higher probability than amino acid substitutions only accessible via 2 or 3 nucleotide substitutions in the same codon (except in the case of a short parent sequence and/or high mutation rate). The Ai are averaged (geometric mean) over the sequence to give the parameter A = (A1 x A2 x ... x An)^(1/N). A is typically around 5.8.
  4. When Lx << Vx then Cx ~ Lx. For equiprobable variants this approximation is good to 5% for Lx < 0.1 Vx (see the mathematical notes on the nucleotide version of PEDEL). For PEDEL-AA, we use this approximation when Lx < 0.1 Vx_1 (the number of 'easy-to-reach' variants) or more rigorously, and optionally, when Rx < 0.1, where Rx is the mean frequency of the most common variant in the sub-library Lx (see note on Rx statistic for details). Lx < 0.1 Vx_1 is usually true for x >= 3 and almost always true for x >= 4. For example for N >= 40 codons, V3_1 >= ~5.8^3 x C(40,3) = 1.9 x 10^6 and V4_1 >= ~5.8^4 x C(40,4) = 1.0 x 10^8, while for N >= 100 codons, V3_1 >= ~5.8^3 x C(100,3) = 3.1 x 10^7 and V4_1 >= ~5.8^4 x C(100,4) = 4.4 x 10^9; and remember these Vx_1 sizes only need to be compared with the relevant sub-library size L3 or L4, not the whole library size. In the Cx ~ Lx region, whether a particular variant is of type Vx_1 or type Vx_2 is irrelevant - either way it will (almost) always be a distinct variant in the library.
  5. For x = 0, 1 and 2, we calculate the expected number of distinct variants, Cx, precisely. This calculation includes variants with multiple nucleotide substitutions in the same codon (i.e. both Vx_1 and Vx_2). The total number of each of the 64 codon types in the input sequence is calculated. The 64 x 20 matrix of probabilities for each codon type mutating to each amino acid type is calculated using the input nucleotide substitution matrix and the input substitution rate. For x = 0, 1 and 2 there are, respectively Vx_2 = 1, 19N and 361N(N-1)/2 total possible variants (i.e. N!/[x!(N-x)!] 19^x, where N is the length of the input sequence in codons). The probability of the input sequence mutating to each of these possible variants is calculated and renormalized by the respective probability sum P0, P1 or P2 (where Px = Sum_{v_i in Vx_2} P(v_i)) to give the normalized probabilities Pn(v_i) of the different variants within the respective sub-libraries Lx, rather than within the whole library. The probability of a particular variant v_i being present in the relevant sub-library Lx is given by 1 - exp(-Pn(v_i) x Lx). These probabilities are quickly summed over all possible variants using the codon counts. Computationally, this is very fast for x = 0, 1 and 2, but can take a few minutes for x = 3; hence the 'exact' calculation is not used on the webserver for x >= 3. The sizes of the sub-libraries Lx are determined separately for the Poisson and PCR distributions as described above, thus resulting in separate Cx estimates for the different distributions.
  6. Ideally, for x >= 3, we will enter the Cx ~ Lx region. In this case all the individual Cx estimates, and the estimated total number of distinct variants in the library C = C0 + C1 + C2 + ..., will be fairly good. A warning is printed in the 'notes' column of the output table of sub-library statistics if there are any x >= 3 values for which the Cx ~ Lx approximation may not apply, in which case Cx is estimated with the formula Cx ~ Vx_1(1-exp(-Lx/Vx_1)) (i.e. ignoring, in these particular sub-libraries, any variants of type Vx_2). This formula is not very accurate and may result in an overestimate (because the Vx_1 are not equiprobable - the higher probability amino acid substitutions [e.g. those accessible by transitions, or via more than one possible nucleotide substitution] will be over-represented at the expense of other amino acid substitutions) or an underestimate (since the Vx_2 variants are ignored). These effects can be quite a large (e.g. if some Vx_1 substitutions are 3 times more likely than others, and the Lx ~ Vx_1 turnover occurs at x ~ 3, then the most common three-amino acid substitutions will be 3^3 = 27 times more likely than the rarest three-amino acid substitutions).

Note that the introduced stop codon and indel statistics and graphs are exact calculations (based on the input substitution, indel and nucleotide matrix parameters) and do not use any of the above approximations (except Poisson statistics). The above approximations are only used for the library completeness statistics.

Comparison with Volles & Lansbury (2005) Monte Carlo simulation.

Property Volles & Lansbury Firth & Patrick
- - Poisson PCR
Truncations (%) 15 15.6
# Full-length clones 3.1 x 10^6 3.18 x 10^6
Protein mutation freq. per aa 0.016 0.0160
Mean # mutations per protein 2.1 2.12
Unmutated sequences (%)* 14 10.1 14.0
# of unique proteins 1.3 x 10^6 1.32 x 10^6 1.29 x 10^6
# of unique point mutations 1990 1989
# of unique single point mutations 1566 1618 1618

* Relative to the total library size (i.e. including truncated variants).

References

  • Volles M.J., Lansbury P.T. Jr. (2005). A computer program for the estimation of protein and nucleic acid sequence diversity in random point mutagenesis libraries, Nucleic Acids Res. 33, 3667-3677.
  • Note on Rx statistics ( show)

    The 'Lx < 0.1 Vx_1' criterion for deciding when to use the 'Cx ~ Lx' approximation is sometimes inaccurate, and can be refined as follows.

    First consider a single nucleotide substitution in a single codon. There are 9 possible mutated codons. An amino acid mutation that can only be coded by a single codon out of the 9 and that requires a transversion, has only a 1 in 15 probability (assuming a transition:transversion ratio of 3), since if p is the probability of a transversion, then 3p is the probability of a transition, and the total probability of the 9 mutated codons is 6(p) + 3(3p) = 15p.

    For example, if the parent codon is GGG (Gly), then the 9 single-nucleotide-substitution codons are

    codon amino acid relative probability
    AGG Arg 3p
    CGG Arg p
    TGG Trp p
    GAG Glu 3p
    GCG Ala p
    GCG Ala p
    GTG Val p
    GGA Gly 3p
    GGC Gly p
    GGT Gly p

    AA Probabilities given the codon mutates Probabilities given the amino acid mutates
    Gly 5/15 (wild-type)
    Arg 4/15 4/10
    Glu 3/15 3/10
    Trp 1/15 1/10
    Ala 1/15 1/10
    Val 1/15 1/10

    The 'Lx < 0.1 Vx_1' criterion assumes that all of the single-nucleotide-substitution non-synonymous amino acid substitutions are equiprobable - i.e. 1 in 5 in the above example, but in general represented by the reciprocal of the 'A' factor described in the above section, where typically A ~ 5.8; whereas, in fact, the most common single-nucleotide-substitution amino acid substitution (GGG -> Arg) is 4 x as likely as the rarest (GGG -> Trp or Ala or Val). In cases where some nucleotide substitutions (as defined by the 4 x 4 nucleotide substitution matrix) are particularly rare, the probability difference between the rarest and the most common single-nucleotide-substitution amino acid substitutions at a given site can be much greater.

    The 'Lx < 0.1 Vx_1' criterion for being in the 'Cx ~ Lx' region is basically to make sure that there are enough variants in Vx to 'absorb' all Lx sub-library members so that (within a small error) at most one sub-library member is equal to any given variant in Vx. In practice, it doesn't matter what the probability of the rarest variants is. What matters for the 'Cx ~ Lx' approximation is that the mean frequency in Lx of the most common variant is < 0.1. In fact the mean frequency of the most common variant in Lx, which we denote by Rx, is easy to calculate for x = 0, 1, 2, ..., 20, ..., and is shown in the PEDEL-AA output table of sub-library statistics.

    Using these Rx values, the 'Lx < 0.1 Vx_1' criterion would be replaced with the criterion 'Rx < 0.1'. In practice this means that if, in the table of sub-library statistics, there are Rx values > 0.1, for which the 'Cx ~ Lx' approximation has been used (i.e. x >= 3 and Lx < 0.1 Vx_1), then the particular corresponding Cx values may be overestimates. A warning and html link are given in the table of sub-library statistics whenever this occurs.

    Input

    Sequence

    In frame sequence that was mutagenised. Note that all symbols that aren't uppecase ATUGC, will be discarded along with a Fasta header (e.g. '>T. maritima Cystathionine β-lyase'), therefore for masked sequences use lowercase.

    Sequence


    Library size
    nt

    Nucleotide mutation matrix

    (non-negative numbers. Overall scaling is unimportant as this is taken from the 'mean number of substitutions per daughter sequence' parameter.)
    To
    From  
    A T G C
    A
    T
    G
    C



    values
    Load
    mutations/seq

    PCR cycles

    PCR efficiency