Given a library of *L* sequences generated by random recombination
of two near-identical genes differing at only a small number of known nucleotide
(or amino acid) positions, we wish to calculate the expected number of
distinct sequences in the library. (Typically assuming the mean number
of crossovers per sequence *m* < 0.1 × sequence length *N*).

>Raillard et al (2001,
*Chem.
Biol.*, **8**, 891-898)
used
DNA shuffling to recombine two bacterial triazine hydrolase genes (*atzA* and *triA*,
GenBank accession numbers U55933 and AF312304, respectively). The *N* =
1425 nt genes differ at nine nucleotide positions: 250, 274, 375, 650,
655, 757, 763, 982 and 991. They screened a library of *L* = 1600 shuffled
variants. They state that 'every variant sequenced had undergone at
least one and as many as four recombination events'. Thus we estimate
that the mean number of observable crossovers per daughter sequence is
around *m*
= 2. The underlying true number of crossovers per daughter
sequence is unknown (click here for
discussion).

Note that, experimentally, crossovers are only observable if they occur in a region that will produce a distinct daughter sequence. One crossover between a consecutive pair of variable positions will produce the same daughter sequence as 3, 5, 7, ... crossovers. Similarly 2, 4, 6, ... crossovers produce the same daughter sequence as no crossovers at all. In addition, any crossovers occurring between one end of the sequence and the first variable position are also unable to be detected by analysis of the daughter sequence.

Typically you will find the mean number of observable crossovers by sequence analysis of a sample of daughter sequences. Nonetheless, the underlying true number of crossovers is also an important statistic to know - especially when you want to try and vary the crossover rate by making adjustments to your recombination protocol.

In DRIVeR, you can choose to enter either the mean number of observable crossovers or the true mean number of all crossovers. Either way, DRIVeR will also calculate and tell you the other statistic.

Note also the following:

The maximum observable crossover rate is (M-1)/2, where M is the number
of variable positions. If the true crossover rate is very high, then the
variable positions will be essentially randomly assigned in each daughter
sequence, and all possible daughters will be essentially equally likely
(in fact you can use GLUE instead of DRIVeR).

Since we assume that crossovers cannot occur immediately following a variable position (due to the nature of the reassembly reaction), if two variable positions are adjacent in the parent sequences, then they will remain linked in all daughter sequences and the number of possible daughter sequences and maximum observable crossover rate will be reduced accordingly.

Enter *L* = 1600, *N* = 1425, *m* = 2, and the variable nucleotide
positions '250 274 375 650 655 757 763 982 991' into the DRIVeR
form, and click the 'observable crossovers' check-box. Select
the 'Calculate for the above parameters' option and click 'Calculate'.
You should get the answer that the expected number of distinct sequences
in the library is ~164, out of a total of 512 possible daughter sequences,
and that the true number of crossovers per daughter sequence is ~10 (*i.e.* about
one crossover per 140 nt on average). Following the link to 'More
statistics' displays the probabilities of a crossover between each
pair of consecutive variable positions.

Alternatively, you could have chosen the 'Calculate and plot for a range of values' option on the base DRIVeR server page. This time you get back three plots plus a link to the 'Statistics' used to draw the plots (useful if you want to find more accurate values than you can read off the plots).

The plots can be useful for determining what library size or crossover
rate you should aim for in order to sample a given fraction of the potential
diversity. In the above example, if you wanted to sample all, or nearly
all, of the 512 possible daughter sequences, then you could use the third
plot (expected number of distinct sequences versus crossover rate and library
size). In order to sample all 512 possible daughter sequences, you could
either try to increase the crossover rate *m* or increase the library
size *L*. However, the plot shows that for *L* = 1600, *m*
would
have to be unrealistically large (crossovers every 5 nt, or more) in order
to sample all 512 possible daughter sequences. With a five-fold increase
in crossover rate (crossovers every 30 nt or so), there should be around
~320 distinct daughter sequences. Alternatively, maintaining the current
crossover rate, and increasing the library size to 25600, should also result
in ~320 distinct daughter sequences.

DRIVeR uses a generic Poisson model of crossover probabilities and positions. There are a few caveats that you should be aware of:

- Remember to use GLUE, not DRIVeR, when all daughter sequences are equally likely (e.g. synthetic shuffling and SISDC).
- In the current implementation of DRIVeR, you can not have more than two different parent sequences.
- As with PEDEL, there is the potential for amplification bias (see PEDEL caveats)
- The two parent sequences are assumed to be highly homologous. For parent
sequences that are homologous at the amino acid level but divergent at
the nucleotide level, crossovers preferentially occur in regions with greater
nucleotide sequence similarity. This bias is not reflected in the DRIVeR
model which, nevertheless, provides a useful upper bound on library diversity.
It has been suggested (Moore G.L., Maranas C.D., 2002,
*Nuc. Acids. Res.*,**30**, 2407) that one way to reduce this bias is to synthesize new parent sequences that maintain the amino acid sequences of the original parents, but have greater similarity at the level of nucleotides. In the case of shuffling two epPCR-generated clones, large-scale sequence dissimilarity will not be an issue. - Any biases in library construction will decrease the actual number of distinct variants represented in the library. In such cases, DRIVeR provides the user with a useful upper bound on the diversity present in the library.

Please refer to Patrick W. M., Firth A. E., Blackburn J.M., 2003, User-friendly
algorithms for estimating completeness and diversity in randomized protein-encoding
libraries, *Protein Eng.*, **16**, 451-457 for further discussion
of DRIVeR.

A good review of the sources of bias in recombination and other directed
evolution protocols can be found in Neylon C., 2004, Chemical and biochemical
strategies for the randomization of protein encoding DNA sequences: library
construction methods for directed evolution, *Nucleic Acids Res.*, **32**,
1448-1459.