example

Determination of HIV-1 Coreceptor Usage via Machine Learning

(paper submitted to Bioinformatics)

The following project procedure is an example of how we used the WEKA machine learning toolkit, and one way you might want to use WetCat.

Objective: Create an interpretable classifier that can discriminate between HIV-1 sequences that use the CCR5 co-receptor exclusively and those that are capable of using the CXCR4 coreceptor. Essentially, given two sets of sequences from the same region of the HIV genome (V3), but with different (experimentally determined) class labels see if there is a consistent difference between them.

Basic Procedure:

Collect labeled sequences from the Los Alamos HIV database
Align using ClustalW
Remove duplicate sequences
Format into tables suitable for WEKA (not neccesary now with WetCat)
Run cross-validation experiments with various classifiers from WEKA
We settled on Support Vector Machines as the most reliable classifier in this case, but found the trees generated by the C4.5 algorithm to be much more informative and interesting to biologists.

Results: SVM classifies 91% of test data correctly. C4.5 classifies 89.5% of test data correctly and (with some attribute selection to limit the number of positions used) produces the following decision tree.

Which at the moment would look like this in the WetCat window:

J48 pruned tree
------------------

p12 = R: cxcr4 (61.0/2.0)
p12 = G
|   p8 = I: cxcr4 (4.0)
|   p8 = T: ccr5 (26.0/6.0)
|   p8 = A: ccr5 (0.0)
|   p8 = E: ccr5 (0.0)
|   p8 = V: ccr5 (0.0)
|   p8 = K: ccr5 (0.0)
|   p8 = Q: ccr5 (0.0)
|   p8 = -: ccr5 (0.0)
|   p8 = S: ccr5 (0.0)
p12 = S: ccr5 (145.0/17.0)
p12 = I: cxcr4 (2.0)
p12 = K: cxcr4 (3.0)
p12 = X: cxcr4 (1.0)
p12 = N: cxcr4 (1.0)
p12 = Q: ccr5 (1.0)

Number of Leaves  : 	16

Size of the tree : 	18

If translated into english, both of these representations would read "If there is an R at position 12, then the class for the sequence is CXCR4, otherwise if there is a G at position 12 and a I at position 8 then CXCR4, otherwise if there is a G at position 12 and a T at position 8 then CCR5 ..."