Classifiers were trained to make the distinction between viruses capable of using CXCR4 as a coreceptor, versus those that were incapable. Dual-tropic (X4R5) viruses were therefore pooled into the X4 class.
Below we compare different classifiers and different combinations of attributes based on their performance in cross-validation experiments. Measurements are in terms of percent correct on cross-validation test sets.
Classifier | All attributes | Sequence only | Sequence only without p12, p29* | Derived Only |
Charge | 87.45 | 87.45 | 0 | 0 |
SVM | 90.22 | 90.86 | 88.79 | 87.47 |
C4.5 | 88.98 | 89.51 | 84.54 | 89.38 |
PART | 88.17 | 89.37 | 85.95 | 89.36 |
* (positions refer to our alignment) - p12 corresponds to 11 and p29 to 25 in the public consensus.
Classifier | p8 | p12 | p29 | p8 , p29 | p12, p29 | p8, p12 |
SVM | 76.38 | 88.91 | 72.15 | 76.19 | 87.34 | 89.89 |
C4.5 | 75.98 | 88.55 | 70.64 | 74.6 | 88.53 | 89.97 |
PART | 74.16 | 88.93 | 74.16 | 74.83 | 86.14 | 89.75 |
Classifier | Genetic Distance and Net Charge | Genetic Distance | Net Charge |
SVM | 88.61 | 78.17 | 81.12 |
C4.5 | 89.73 | 84.44 | 83.02 |
PART | 89.94 | 84.44 | 83.02 |
All of the classifiers involved in this analysis reached an apparent maximum of approximately 90% during cross-validation trials. One limitation to classifier performance comes from the minor subset of sequences that violate the sequence-phenotype relationships exhibited by the vast majority of training cases. We determined that a large proportion of these sequences represent a third, biologically distinct phenotypic class of dual-tropic isolates that can utilize either chemokine receptor to enter a target cell. It could be argued, therefore, that these errors were reflections of a conflict between the two-way classification task and the tripartite structure of the phenotypic data, rather than a shortcoming in the classifiers themselves. The classifiers were incapable of reliably segregating the cases into the three classes, however, due to the low number of dual-tropic isolates in the training data set. This issue may be revisited using similar techniques in the future, when additional dual-tropic viruses have been characterized and sequenced.
The performance of all of the methods including the charge rules is impressive given:
Based on the inherent noise and lack of complete information in the data as it exists at the time of this analysis, it follows that there is an upper bound to classifier performance that is below 100%. We would like to apply the same techniques to systematically determine how sequence positions within the HIV-1 envelope outside of the V3 loop subregion modulate coreceptor usage. This will depend on the large-scale generation of full-length envelope sequences with corresponding phenotypic data. In addition, as in vitro assays become more sophisticated, it should be possible to describe coreceptor usage on a continuous scale, rather than categorizing the data into discrete, arbitrary classes. This information will allow for a high-resolution map of sequence against phenotype, whereby subtle changes in sequence could be predictive of minor effects on coreceptor preference.