Results

Prediction of Coreceptor Usage for HIV-1 Results

Classifiers were trained to make the distinction between viruses capable of using CXCR4 as a coreceptor, versus those that were incapable. Dual-tropic (X4R5) viruses were therefore pooled into the X4 class.

Below we compare different classifiers and different combinations of attributes based on their performance in cross-validation experiments. Measurements are in terms of percent correct on cross-validation test sets.

Classifier	All attributes	Sequence only	Sequence only without p12, p29*	Derived Only
Charge	87.45	87.45	0	0
SVM	90.22	90.86	88.79	87.47
C4.5	88.98	89.51	84.54	89.38
PART	88.17	89.37	85.95	89.36

* (positions refer to our alignment) - p12 corresponds to 11 and p29 to 25 in the public consensus.

Classifier	p8	p12	p29	p8 , p29	p12, p29	p8, p12
SVM	76.38	88.91	72.15	76.19	87.34	89.89
C4.5	75.98	88.55	70.64	74.6	88.53	89.97
PART	74.16	88.93	74.16	74.83	86.14	89.75

Classifier	Genetic Distance and Net Charge	Genetic Distance	Net Charge
SVM	88.61	78.17	81.12
C4.5	89.73	84.44	83.02
PART	89.94	84.44	83.02

All of the classifiers involved in this analysis reached an apparent maximum of approximately 90% during cross-validation trials. One limitation to classifier performance comes from the minor subset of sequences that violate the sequence-phenotype relationships exhibited by the vast majority of training cases. We determined that a large proportion of these sequences represent a third, biologically distinct phenotypic class of dual-tropic isolates that can utilize either chemokine receptor to enter a target cell. It could be argued, therefore, that these errors were reflections of a conflict between the two-way classification task and the tripartite structure of the phenotypic data, rather than a shortcoming in the classifiers themselves. The classifiers were incapable of reliably segregating the cases into the three classes, however, due to the low number of dual-tropic isolates in the training data set. This issue may be revisited using similar techniques in the future, when additional dual-tropic viruses have been characterized and sequenced.

The performance of all of the methods including the charge rules is impressive given:

errors in the experimental determination and (database annotation) of viral phenotype (i.e. documentation of false sequence-phenotype relationships),
difficulties in aligning the hypervariable V3 region of the HIV genome,
strong influence of positions outside of the V3 loop on coreceptor usage (Rizzuto et al, 1998)
the biological reality of coreceptor usage being a continuum between CCR5- and CXCR4 usage, rather than a discrete, binary phenotype.

Based on the inherent noise and lack of complete information in the data as it exists at the time of this analysis, it follows that there is an upper bound to classifier performance that is below 100%. We would like to apply the same techniques to systematically determine how sequence positions within the HIV-1 envelope outside of the V3 loop subregion modulate coreceptor usage. This will depend on the large-scale generation of full-length envelope sequences with corresponding phenotypic data. In addition, as in vitro assays become more sophisticated, it should be possible to describe coreceptor usage on a continuous scale, rather than categorizing the data into discrete, arbitrary classes. This information will allow for a high-resolution map of sequence against phenotype, whereby subtle changes in sequence could be predictive of minor effects on coreceptor preference.