Prediction of Coreceptor Usage for HIV-1

Notes on dataset

All of the V3 loop sequence entries containing documentation of experimentally determined coreceptor usage in the LANL (Los Alamos National Laboratory) HIV Sequence Database were downloaded in FASTA format. Duplicate sequences were removed from the training set, to minimize the effects of a sampling bias in the database. V3 loops shorter than 34 or longer than 36 amino acids in length were deleted from the set in the interests of producing a relatively gap-free alignment. ClustalW (Thompson et al, 1994) was launched from within the BioEdit Sequence Alignment Editor (Hall, 1999) to generate an automated multiple sequence alignment of the remaining 271 sequences, using the default parameter settings.

Composition of training set

CCR5

CXCR4 (Capable)
168 103

Although the resulting alignment may have been suboptimal, we refrained from hand-editing the alignment to enhance the reproducibility of the experiment.

In addition to the amino acid sequence, we ran experiments in which we included net charge, sequence length, pairwise genetic distance to an r5 consensus, and the charge rule prediction as attributes for each sample in the dataset. For more explanation of these attributes see Derived.

Classifiers were trained to make the distinction between viruses capable of using CXCR4 as a coreceptor, versus those that were incapable. Dual-tropic (X4R5) viruses were therefore pooled into the X4 class.