In addition to the amino acid sequences themselves, we built classifiers which incorporated the following derived attributes:
Download dataset including these attributes.
Genetic Distance refers to the pairwise distance between a sequence and an R5 consensus sequence. The program Protdist in the PHYLIP package (Felsenstein, 1989) was used to build a matrix of pairwise distances between every sequence included in our analysis and an artificial R5 consensus strain (generated using the Se-Al v2.0 Carbon Sequence Alignment Editor). The evolutionary pattern underlying the distinction between these phenotypic classes motivated us to include genetic distance from an artificial R5 consensus as an attribute; X4 strains often evolve from R5 ancestral variants during the course of infection (Kreisberg et al, 2001). Therefore, a positive correlation may exist between distance to the R5 consensus and likelihood of CXCR4 usage.
Gap number was included as a training attribute to explore the possibility that overall V3 loop length contributes to viral phenotype (gap number is inversely related to sequence length).
The net charge of each sequence was calculated and recorded by assigning a +1 positive charge to basic residues (Arginine, Lysine, Histidine), a -1 negative charge to the acidic residues (Aspartic acid and Glutamic acid), and a neutral charge to all other amino acids. Net charge of the V3 loop sequence has previously been reported as a reasonable predictor of HIV coreceptor usage (Fouchier et al, 1992).
We predicted and recorded the coreceptor usage of each isolate in our data set using the conventional charge rule (presence of a positively-charged residue at positions 11 and/or 25 of the V3 loop = X4, all else = R5). The charge rule designation was included as an attribute to indirectly evaluate the predictive power of the conventional classification scheme and to see if a simple addition to the rule could be learned that would improve its performance.