What is an SLQ score?
SLQ is a domain-specific performance metric devised by researchers at
the Stanford Linear Accellerator (SLAC) to measure the performance of
predictions made for certain kinds of particle physics problems. SLQ
works with models that make continuous predictions on the interval
[0-1]. It breaks this interval into a series of bins. For the KDD-CUP
we are using 100 equally sized bins. The 100 bins are: 0.00-0.01,
0.01-0.02, ..., 0.98-0.99, 0.99-1.00.
PERF's SLQ option allows you to specify the number of bins. You
should use 100 bins for the KDD-CUP competition. For example under
unix:
perf -slq 100 < testdata
SLQ places predictions into the bins based on the predicted values.
If your model predicts a value of 0.025 for a case, that case will be
placed in the third bin 0.02-0.03. In each bin SLQ keeps track of the
number of true positives and true negatives. SLQ is maximized if bins
have high purity, e.g., if all bins contain all 0's or all 1's. This
is unlikely, so SLQ computes a score based on how pure the bins are.
The score is computed as follows: suppose a bin has 350 positives and
150 negatives. The error of this bin if we predict positives (the
predominant class) for all cases is 150/(350+150) = 0.30 = err. The
contribution of this bin to the total SLQ score is (1-2*err)^2, times
the total number of points in this bin divided by the total number of
points in the data set. Dividing by the size of the total data set
normalizes the score so that it is more independent of the size of the
data set. Multiplying by the number of points in the bin gives more
weight to bins that have more points, and less weight to bins with
fewer points. The term (1-2*err)^2 is maximized (= 1) when all points
in that bin are the same class, i.e., when the error due to predicting
the predominant class is minimized. This is somewhat similar to
measures of node purity in decision trees. The sum over the 100 bins
is maximized when the contribution of each bin (weighted by the number
of points in each bin) is maximized.
Collaborators at SLAC say that this is an important quantity for their
work. They want to find models that maximize it. Other than that, we
do not know exactly why the SLQ metric is defined this way. SLQ is a
domain-specific performance metric custom designed for this particular
class of problems.
Note that SLQ only cares about the purity in each node. A model would
have poor accuracy and ROC below 0.5 if you switch the labels used for
positive and negatives after training, but SLQ is insensitive to this.
Hope this helps!
June 8: Is there a bug in how PERF calculates APR (average precision)?
Yes! A bug was found in how perf calculates precision and average precision (APR). The bug was that perf returned APR = 0 when there was only one true positive case in the data. We fixed perf (release 5.10 now) and made a few small improvements to perf's error handling. You only need to worry about this if you are working on the protein problem which uses average precision as one of the metrics. The problem does not affect any of the metrics used for the physics problem. Many thanks to the team who found this problem and pointed it out to us. We have implemented an improved algorithm for average precision. The fix is the same as the experimental code we released two weeks ago, so if you are using that code everything should be fine. Note that the way we define average precision for this competition is somewhat different than the way average precision is usually defined in information retrieval. In IR, average precision usually is the average of the precision at recalls of 0.0, 0.1, 0.2, ... 1.0, with a special rule for interpolating precisions to these recall values. For the KDD-CUP we are using a similar, but more precise method. Instead of simply calculating the precision at the 11 recalls mentioned above, we calculate the precision at every recall where it is defined. For each of these recall values, we find the threshold that produces the maximum precision, and take the average over all of the recall values greater than 0. This yields a lower-variance metric than the IR definition and thus should be easier to optimize to. We believe this new way of calculating APR is a significant improvement over the method typically used in IR.
What format will be used for submitting predictions?
Submissions are not due until July 14, but there have been questions about the submision format. The format we will use for submissions is exactly the same as the input format used by PERF, except that that targets should be replaced with the unique example ID numbers. We will then replace the ID numbers with the true targets, and run your predictions through PERF to calculate performances. Because there are four criteria for each problem (eight criteria total if you make submissions for both problems), you will submit four sets of predictions for each problem, one for each criterion. If you do not want to optimize learning for each criterion, you can submit the same predictions for multiple criteria. Thus you can submit the same predictions for Accuracy and ROC Area, and submit different predictions for Cross-entropy.
Are there missing values in the data?
The protein/bio data does not contain any missing values. The physics data does contain "missing" values. In the phy data set, columns 22,23,24 and 46,47,48 use a value of "999" to denote "not available", and columns 31 and 57 use "9999" to denote "not available". These are the column numbers in the data tables starting with 1 for the first column (the case ID numbers). If you remove the first two columns (the case ID numbers and the targets), and start numbering the columns at the first attribute, these are attributes 20,21,22, and 44,45,46, and 29 and 55, respectively. You may treat missing values any way you want, including coding them as a unique value, imputing missing values, using learning methods that can handle missing values, ignoring these attributes, etc.
How do I run PERF under windows?
I'm not very familiar with Windows, but I'll do my best. One way to use perf on windows is to open a cmd window (try typing "cmd" in "Run" under the "Start" menu). In the cmd window change to the directory perf is in using "cd" commands. Then execute perf.exe, specifying the options you want on the command line, and read from a file containing the targets and predictions. Here's a simple example from my windows machine using the directories you get if you unzip the windows download for perf:
C:\>cd c:\perf.windows
C:\perf.windows>dir
Directory of C:\perf.windows
05/04/2004 01:18p <DIR> .
05/04/2004 01:18p <DIR> ..
04/28/2004 06:04p 141,956 perf.exe
04/28/2004 07:10p <DIR> perf_sample_test_data
04/29/2004 01:15p 5,966 perf.1
2 File(s) 147,922 bytes
3 Dir(s) 15,593,144,320 bytes free
C:\perf.windows>perf.exe -acc -cxe -roc -slq 0.01 < perf_sample_test_data/testperfdata
ACC 0.83292 pred_thresh 0.500000
ROC 0.88380
SLQ 0.80851 Bin_Width 0.010000
CXE 0.57335
On the protein match problem, isn't it possible to always attain perfect TOP1 by assigning every test set to class "1"? Is there a restriction to prevent this?
To calculate TOP1, we sort cases by the values you predict for them. (If you prefer to sort the cases for us, just give us "predictions" that are the ranks you assign to each case so we get your ordering when we sort -- be sure to rank the case most likely to be class 1 with the *highest* rank (a number near 1000, not near 1).) If you predict all cases to be class 1, they all sort to the same position, so the homologs do not sort by themselves to the top 1 position. We will split ties in the ordering pessimistically. That is, if three cases tie for top 1 position, and only 1 (or 2) of these 3 cases are homologs (true class 1), then the TOP1 score will be 0 because 2 (or 1) of the cases predicted to be top 1 were not homologs. In other words, we are calculating TOP1 so that ties will hurt instead of help. This is appropriate for a score like TOP1 where you really have to predict a single case that is most likely to be a match.
Are ties always broken pessimistically?
No. The performance measures that depend on ordering the data are TOP1, Average Precision, ROC Area, Rank of the Last Positive, and the Slac Q-Score. Of these, only TOP1 and Rank of Last (RKL) break ties pessimistically. RKL, which returns the rank of the positive case last predicted to be positive, pessimistically assumes that the positive cases rank to the end of any group of cases they tie with. This is appropriate for a measure like RKL that measures how far down the list the model places a positive case. But the other ordering measures: Average Precision, ROC Area, and SLQ are not pessimistic or optimistic. They break ties by assigning fractional values to all cases that are tied. For example, if 10 cases are tied, and 3 of those cases are true class 1, these measures accumulate 3/10 = 0.3 points of "positiveness" as they include each point from the group of ties. There are other ways to break ties for these measures. This approach has the advantages that it is easy to understand, is neutral to ties, and is applicable to most performance measures that integrate performance over the full range of an ordering.
When will the tasks and the data be available?
The datasets and tasks will be made available at this WWW-site in the week of April 26, 2004.
|