KDD Cup 2004 - Description of Performance Metrics

METRICS FOR THE PARTICLE PHYSICS PROBLEM:

*** Accuracy (maximize): 

We use the usual definition of accuracy -- the number of cases
predicted correctly, divided by the total number of cases.  The
predictions submitted for accuracy can be boolean or continuous.
When submitting predictions, you will be asked for a prediction 
threshold: predictions above or equal to this threshold will be
treated as class 1, predictions below threshold are treated as
class 0.  An accuracy of 1.00 is perfect prediction.  Accuracy
near 0.00 is poor.

*** Area Under the ROC Curve (maximize):

We use the usual definition for ROC Area (often abbreviated AUC).
An ROC plot is a plot of true positive rate vs. false positive
rate as the prediction threshold sweeps through all the possible
values.  If you are familiar with sensitivity and specificity, 
this is the same as plotting sensitivity vs. 1-specificity as 
you sweep the threshold.  AUC is the area under this curve. AUC
of 1 is perfect prediction -- all positive cases sorted above
all negative cases.  AUC of 0.5 is random prediction -- there is
no relationship between the predicted values and truth.  AUC
below 0.5 indicates there is a relationship between predicted
values and truth, but the model is backwards, i.e., predicts
smaller values for positive cases!  Another way to think of
ROC Area is to imagine sorting the data by predicted values.
Suppose this sort is not perfect, i.e., some positive cases
sort below some negative cases.  AUC effectively measures how
many times you would have to swap cases with their neighbors to
repair the sort.  

                   # of swaps to repair sort
  AUC =  1.0 - ----------------------------------
               (# of positives) x (# of negatives)

There's a little more info on AUC in the comments at the top of 
the perf source code, but you probably don't need to look at it.

*** Cross-Entropy (minimize):

We use the usual definition for cross-entropy.  Cross-entropy,
like squared error, measures how close predicted values are
to target values.  Unlike squared error, cross-entropy assumes 
the predicted values are probabilities on the interval [0,1]
that indicate the probability that the case is class 1.   

  Cross-Entropy = SUM [(Targ)*log(Pred) + (1-Targ)*log(1-Pred)]

where Targ is the Target Class (0 or 1) and Pred is the predicted
probability that the case is class 1.  Note that the terms (Targ)
and (1-Targ) are alternately 0 or 1 so log(Pred) is added to the
sum for positive cases and log(1-Pred) is added for negative cases.
Note that cross entropy is infinity of the Targ is 0 and Pred is
1 or Targ is 1 and Pred is 0.  In the code we protect against this
by just returning a very, very large number (that may be platform
dependent).  If you get cross-entropies that are very large, you
probably have at least one prediction that is like this.

To make cross-entropy independent of data set size, we use the 
mean cross-entropy, i.e., the sum of the cross-entropy for each
case divided by the total number of cases.

*** SLAC Q-Score (maximize):

SLQ is a domain-specific performance metric devised by researchers at
the Stanford Linear Accellerator (SLAC) to measure the performance of
predictions made for certain kinds of particle physics problems. SLQ
works with models that make continuous predictions on the interval
[0-1]. It breaks this interval into a series of bins.  For the KDD-CUP
we are using 100 equally sized bins. The 100 bins are: 0.00-0.01,
0.01-0.02, ..., 0.98-0.99, 0.99-1.00.

PERF's SLQ option allows you to specify the number of bins. You should
use 100 bins for the KDD-CUP competition. For example under unix:

perf -slq 100 < testdata 

SLQ places predictions into the bins based on the predicted values. If
your model predicts a value of 0.025 for a case, that case will be
placed in the third bin 0.02-0.03. In each bin SLQ keeps track of the
number of true positives and true negatives. SLQ is maximized if bins
have high purity, e.g., if all bins contain all 0's or all 1's. This
is unlikely, so SLQ computes a score based on how pure the bins are.

The score is computed as follows: suppose a bin has 350 positives and
150 negatives.  The error of this bin if we predict positives (the
predominant class) for all cases is 150/(350+150) = 0.30 = err. The
contribution of this bin to the total SLQ score is (1-2*err)^2, times
the total number of points in this bin divided by the total number of
points in the data set. Dividing by the size of the total data set
normalizes the score so that it is more independent of the size of the
data set. Multiplying by the number of points in the bin gives more
weight to bins that have more points, and less weight to bins with
fewer points. The term (1-2*err)^2 is maximized (= 1) when all points
in that bin are the same class, i.e., when the error due to predicting
the predominant class is minimized.  This is somewhat similar to
measures of node purity in decision trees. The sum over the 100 bins
is maximized when the contribution of each bin (weighted by the number
of points in each bin) is maximized.

Collaborators at SLAC say that this is an important quantity for their
work. They want to find models that maximize it. Other than that, we
do not know exactly why the SLQ metric is defined this way. SLQ is a
domain-specific performance metric custom designed for this particular
class of problems.

Note that SLQ only cares about the purity in each node. A model would
have poor accuracy and ROC below 0.5 if you switch the labels used for
positive and negatives after training, but SLQ is insensitive to this.


METRICS FOR THE BIOLOGY/PROTEIN PROBLEM:

Unlike the particle physics problem, the biology problem comes in
blocks.  Performance is measured on each block independently, and 
then the average performance across the blocks is used as the final
performance metric.

*** TOP1 (maximize): 

The fraction of blocks with a true homolog (class 1) predicted as the
best match.  In each block the predictions are sorted by predicted
value.  If the case that sorts to the top (highest predicted value) is
class 1, you score a 1 for that block.  If the top case is class 0,
you score 0.  (If there are ties for the top case, you score 0 unless
all of the tied cases are class 1.  This means ties do not help.)
The best TOP1 that can be achieved is to predict the highest value for
a true homolog in each block.  This would yield TOP1 = 1.0, perfect
TOP1 prediction.  TOP1 = 0.0 is poor.

*** Average Rank of Last Homolog (minimize):

Like TOP1, cases are sorted by predicted value.  This metric measures
how far down the sorted cases the last true homolog falls.  A robust
model would predict high values for all true homologs, so the last
homolog would still sort high in the ordered cases.  A rank of 1 means
that the last homolog sorted to the top position.  This is excellent,
but can only happen if there is only 1 true homolog in the block. As
with TOP1, ties do not help -- if cases are tied perf assumes the true
homologs sort to the low end of the ties.  It is not possible to
achieve an average rank for the last homolog of 1 on this data because
most blocks have more than one homolog.  Average ranks near 1000
indicate poor performance.

*** Root-Mean-Squared-Error (minimze):

We use the usual defintion for squared error, and then divide it by
the number of cases to make it independent of data set size, and then
take the square root to convert it back to the natural scale of the
targets and predictions.  

  RMSE = SQRT( 1/N * SUM (Targ-Pred)^2 )

For the KDD-CUP, the targets should be 0 and 1 and the predictions
should be on the interval [0,1].

*** Mean Average Precision (maximize):

This is the metric that we got wrong in the first release of PERF.
Hopefully you are using the newer release of PERF.  One problem with
average precision is that there are a variety of definitions in the
literature.  To further the confusion, we are using a relatively
non-standard definition sometimes called "expected precision".  If you
are familiar with precision/recall plots, precision is the percentage
of points predicted to be positive (above some threshold) that are
positives.  High precision is good.  We sweep the threshold so that
each case becomes predicted positive one at a time, calculate the
precision at each of these thresholds, and then take the average of
all of these precisions.  This is kind of like the area under a
precision recall plot, but not exactly.  One problem with average
precision is how to handle ties between positive and negative cases.
For the KDD-CUP competition we developed a new method for handling
ties that calculates the average precision of all possible orderings
of the cases that are tied.  Fortunately, we are able to do this
without enumerating all possible orderings!  Unfortuantely, it is
still expensive when there are many ties (order n^2 in the number of
ties), so the code backs off to simpler pessimistic calculation when
there are thousands of ties.  (You shouldn't run into this in the
KDD-CUP since each block has only about 1000 cases.)

If you are not familiar with precision, there is a description of
it in the comments at the top of the source code for PERF.  But you
probably don't need to look at it.

Average precision is measured is measured on each block, and then
the mean of each block's average precision is used as the final
metric.  A mean average precision of 1 is perfect prediction. 
The lowest mean average precision possible is 0.0.
Home
KDD 2004 Conference