example

Determination of HIV-1 Coreceptor Usage via the C4.5 decision tree generator

Below is a condensed version of the decision tree that generates predictions of coreceptor usage when C4.5 is selected. Based on cross-validation experiments, this tree is expected to classify unseen test cases with 89% accuracy. It classifies 254 out of our 271 training cases correctly (93.7%). Here is a breakdown of C4.5's performance in cross-validation.

TP Rate	FP Rate	Precision	Recall	Class
0.757	0.024	0.951	0.757	cxcr4
0.976	0.243	0.868	0.976	ccr5

p12 = G
|   p6 = E: cxcr4 (1.0)
|   p6 = K: cxcr4 (5.0)
|   p6 = N
|   |   p32 = I: ccr5 (17.0)
|   |   p32 = V
|   |   |   p22 = A: ccr5 (6.0/1.0)
|   |   |   p22 = H: cxcr4 (1.0)
|   |   |   p22 = V: cxcr4 (2.0)
p12 = I: cxcr4 (3.0)
p12 = K: cxcr4 (4.0)
p12 = N: cxcr4 (1.0)
p12 = Q: ccr5 (1.0)
p12 = R: cxcr4 (69.0/2.0)
p12 = S
|   p35 = -: ccr5 (1.0)
|   p35 = I: ccr5 (156.0/14.0)
|   p35 = M: cxcr4 (1.0)
|   p35 = P: cxcr4 (2.0)
p12 = X: cxcr4 (1.0)

If translated into english, this would read "If there is a G at position 12 and an E or a K at position 6 then CXCR4, otherwise if there is a G at position 12 and a N at position 6 and a I at position 32 then CCR5 ...". This continues down the tree. Cases not covered by any of these rules default to ccr5.

The numbers in parantheses to the right of the leaves indicate how many samples from the training set reached that leaf versus how many were incorrectly classified. (# reached / # misclassified) .