Diversity Resulting from In Vitro Recombination


Given a library of L sequences generated by random recombination of two near-identical genes differing at only a small number of known nucleotide (or amino acid) positions, we wish to calculate the expected number of distinct sequences in the library. (Typically assuming the mean number of crossovers per sequence m < 0.1 × sequence length N).

Example ( show hide )

>Raillard et al (2001, Chem. Biol., 8, 891-898) used DNA shuffling to recombine two bacterial triazine hydrolase genes (atzA and triA, GenBank accession numbers U55933 and AF312304, respectively). The N = 1425 nt genes differ at nine nucleotide positions: 250, 274, 375, 650, 655, 757, 763, 982 and 991. They screened a library of L = 1600 shuffled variants. They state that 'every variant sequenced had undergone at least one and as many as four recombination events'. Thus we estimate that the mean number of observable crossovers per daughter sequence is around m = 2. The underlying true number of crossovers per daughter sequence is unknown (click here for discussion).

Note that, experimentally, crossovers are only observable if they occur in a region that will produce a distinct daughter sequence. One crossover between a consecutive pair of variable positions will produce the same daughter sequence as 3, 5, 7, ... crossovers. Similarly 2, 4, 6, ... crossovers produce the same daughter sequence as no crossovers at all. In addition, any crossovers occurring between one end of the sequence and the first variable position are also unable to be detected by analysis of the daughter sequence.

Typically you will find the mean number of observable crossovers by sequence analysis of a sample of daughter sequences. Nonetheless, the underlying true number of crossovers is also an important statistic to know - especially when you want to try and vary the crossover rate by making adjustments to your recombination protocol.

In DRIVeR, you can choose to enter either the mean number of observable crossovers or the true mean number of all crossovers. Either way, DRIVeR will also calculate and tell you the other statistic.

Note also the following:
The maximum observable crossover rate is (M-1)/2, where M is the number of variable positions. If the true crossover rate is very high, then the variable positions will be essentially randomly assigned in each daughter sequence, and all possible daughters will be essentially equally likely (in fact you can use GLUE instead of DRIVeR).

Since we assume that crossovers cannot occur immediately following a variable position (due to the nature of the reassembly reaction), if two variable positions are adjacent in the parent sequences, then they will remain linked in all daughter sequences and the number of possible daughter sequences and maximum observable crossover rate will be reduced accordingly.

Enter L = 1600, N = 1425, m = 2, and the variable nucleotide positions '250 274 375 650 655 757 763 982 991' into the DRIVeR form, and click the 'observable crossovers' check-box. Select the 'Calculate for the above parameters' option and click 'Calculate'. You should get the answer that the expected number of distinct sequences in the library is ~164, out of a total of 512 possible daughter sequences, and that the true number of crossovers per daughter sequence is ~10 (i.e. about one crossover per 140 nt on average). Following the link to 'More statistics' displays the probabilities of a crossover between each pair of consecutive variable positions.

Alternatively, you could have chosen the 'Calculate and plot for a range of values' option on the base DRIVeR server page. This time you get back three plots plus a link to the 'Statistics' used to draw the plots (useful if you want to find more accurate values than you can read off the plots).

The plots can be useful for determining what library size or crossover rate you should aim for in order to sample a given fraction of the potential diversity. In the above example, if you wanted to sample all, or nearly all, of the 512 possible daughter sequences, then you could use the third plot (expected number of distinct sequences versus crossover rate and library size). In order to sample all 512 possible daughter sequences, you could either try to increase the crossover rate m or increase the library size L. However, the plot shows that for L = 1600, m would have to be unrealistically large (crossovers every 5 nt, or more) in order to sample all 512 possible daughter sequences. With a five-fold increase in crossover rate (crossovers every 30 nt or so), there should be around ~320 distinct daughter sequences. Alternatively, maintaining the current crossover rate, and increasing the library size to 25600, should also result in ~320 distinct daughter sequences.

Caveats ( show hide )

DRIVeR uses a generic Poisson model of crossover probabilities and positions. There are a few caveats that you should be aware of:

  • Remember to use GLUE, not DRIVeR, when all daughter sequences are equally likely (e.g. synthetic shuffling and SISDC).
  • In the current implementation of DRIVeR, you can not have more than two different parent sequences.
  • As with PEDEL, there is the potential for amplification bias (see PEDEL caveats)
  • The two parent sequences are assumed to be highly homologous. For parent sequences that are homologous at the amino acid level but divergent at the nucleotide level, crossovers preferentially occur in regions with greater nucleotide sequence similarity. This bias is not reflected in the DRIVeR model which, nevertheless, provides a useful upper bound on library diversity. It has been suggested (Moore G.L., Maranas C.D., 2002, Nuc. Acids. Res., 30, 2407) that one way to reduce this bias is to synthesize new parent sequences that maintain the amino acid sequences of the original parents, but have greater similarity at the level of nucleotides. In the case of shuffling two epPCR-generated clones, large-scale sequence dissimilarity will not be an issue.
  • Any biases in library construction will decrease the actual number of distinct variants represented in the library. In such cases, DRIVeR provides the user with a useful upper bound on the diversity present in the library.

Please refer to Patrick W. M., Firth A. E., Blackburn J.M., 2003, User-friendly algorithms for estimating completeness and diversity in randomized protein-encoding libraries, Protein Eng., 16, 451-457 for further discussion of DRIVeR.

A good review of the sources of bias in recombination and other directed evolution protocols can be found in Neylon C., 2004, Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution, Nucleic Acids Res., 32, 1448-1459.


Library size



Crossovers counting

xover input