UPDATE 5th of July: One of the participant pointed out that on some platforms, PERF might underestimate APR by a small amount when there are certain numbers of ties. Most likely, this will not affect your results. However, we put a new version that is more robust on the KDD-Cup Web page. The new version is 5.11. Just in case you want to be absolutely certain, you might want to use the new code. Note again, that this only affects APR on the protein problem. However, if you are still using a version of PERF prior to 5.10, you should download and use the new version. See the FAQ, or README file in the download, for more detail.
We will use the program perf to measure the performance of the predictions you submit on the eight performance metrics. You do not need to use perf, but using perf will insure that the metrics you optimize to are defined the same way we will measure them.
Perf calculates a variety of performance metrics suitable for boolean classification problems. Metrics include: accuracy, root-mean-squared-error, cross-entropy, precision, recall, precision/recall break-even point and F-score, area under the ROC curve, lift, weighted cost, top 1, top 10, rank of lowest positive case, q-score, several measures of probability calibration, etc.
Perf is available for download already compiled for a number of platforms:
Each directory contains a subdirectory with sample test data and sample output. You can us this to test that perf works on your platorm, and to see what the input format to perf is.
We recently made a number of changes to perf for the KDD-CUP, so please let us know if you find bugs.
Perf is a stand-alone C-program that is easy to compile on different platforms. Here is the source code and makefile:
Perf can read from the standard input, or from files containing target values and predicted values. When reading from the standard input, the input to perf is a series of lines, each of which contains a target value and predicted value separated by whitespace. Perf reads the entire input corresponding to the targets and predictions for a train or test set, and then calculates the performance measures you request. Here is a short example of the input file. The first column is the target class (0 or 1). The second column is the probabilities the model predicts the case is in class 1. The input format allows any kind of whitespace to separate the two columns (e.g. spaces, tabs, commas).
1 0.80962
0 0.48458
1 0.65812
0 0.16117
0 0.47375
0 0.26587
1 0.71517
1 0.63866
0 0.36296
1 0.89639
0 0.35936
0 0.22413
0 0.36402
1 0.41459
1 0.83148
0 0.23271
The software can calculate a variety of performance measures, but you won't need most of them for the competition. For the KDD-CUP 2004 competition the performance measures we are interested in are:
- For the Particle Physics Problem:
- ACC: accuracy
- ROC: area under the ROC curve (aka AUC)
- CXE: cross-entropy
- SLQ 0.01: Stanford Linear Accelerator Q-score (more on this later)
- For the Protein Matching Problem:
- TOP1: how often is a correct match (a homolog) ranked first
- RMS: root-mean-squared-error (similar to optimizing squared error)
- RKL: rank of the last matching case (rank of the last positive case)
- APR: average precision
If you specify no options, perf prints a variety of performance measures. Typically you will specify options so that perf only calculates the performance metric(s) you are interested in, but here is sample output of perf when run on one of the test data sets included in the perf_sample_test_data directory with no options specified:
[caruana] perf < testperfdata
ACC 0.83292 pred_thresh 0.500000
PPV 0.35294 pred_thresh 0.500000
NPV 0.96203 pred_thresh 0.500000
SEN 0.71429 pred_thresh 0.500000
SPC 0.84680 pred_thresh 0.500000
PRE 0.35294 pred_thresh 0.500000
REC 0.71429 pred_thresh 0.500000
PRF 0.47244 pred_thresh 0.500000
LFT 3.36975 pred_thresh 0.500000
SAR 0.78902 pred_thresh 0.500000 wacc 1.000000 wroc 1.000000 wrms 1.000000
ACC 0.90524 freq_thresh 0.617802
PPV 0.54762 freq_thresh 0.617802
NPV 0.94708 freq_thresh 0.617802
SEN 0.54762 freq_thresh 0.617802
SPC 0.94708 freq_thresh 0.617802
PRE 0.54762 freq_thresh 0.617802
REC 0.54762 freq_thresh 0.617802
PRF 0.54762 freq_thresh 0.617802
LFT 5.22846 freq_thresh 0.617802
SAR 0.81313 freq_thresh 0.617802 wacc 1.000000 wroc 1.000000 wrms 1.000000
ACC 0.91521 max_acc_thresh 0.712250
PPV 0.68182 max_acc_thresh 0.712250
NPV 0.92876 max_acc_thresh 0.712250
SEN 0.35714 max_acc_thresh 0.712250
SPC 0.98050 max_acc_thresh 0.712250
PRE 0.68182 max_acc_thresh 0.712250
REC 0.35714 max_acc_thresh 0.712250
PRF 0.46875 max_acc_thresh 0.712250
LFT 6.50974 max_acc_thresh 0.712250
SAR 0.81645 max_acc_thresh 0.712250 wacc 1.000000 wroc 1.000000 wrms 1.000000
PRB 0.54762
APR 0.51425
ROC 0.88380
R50 0.49954
RKL 273
TOP1 1.00000
TOP10 1.00000
SLQ 0.80851 Bin_Width 0.010000
RMS 0.34966
CXE 0.57335
CA1 0.22115 19_0.05_bins
CA2 0.22962 Bin_Size 100
To make the output simpler, you can specify only the measure(s) you are interested in. For example, to compute just the ROC Area or just the average precision:
[caruana] perf -roc < testperfdata
ROC 0.88380
[caruana] perf -apr < testperfdata
APR 0.51425
To compute the accuracy, cross-entropy, and root-mean-squared-error:
[caruana] perf -acc -cxe -rms < testperfdata
ACC 0.83292 pred_thresh 0.500000
RMS 0.34966
CXE 0.57335
Note that accuracy needed a threshold and perf used a default threshold of 0.5. If you want to use a different threshold (e.g. a threshold of 0 when using SVMs), the threshold can be specified with a "-threshold" option:
[caruana] perf -acc -threshold 0.0 -cxe -rms < testperfdata
ACC 0.10474 pred_thresh 0.000000
RMS 0.34966
CXE 0.57335
Note that the threshold changed only the accuracy, but not the RMS or CXE. Predictions below threshold are treated as class 0 and predictions above threshold are treated as class 1. When submitting predictions for the KDD-CUP for accuracy (the only performance measure we are using in the cup that depends on a threshold) you will be asked to submit a threshold as well.
Perf can read from files instead of from the standard input:
[caruana] perf -acc -threshold 0.0 -cxe -rms -file testperfdata
ACC 0.10474 pred_thresh 0.000000
RMS 0.34966
CXE 0.57335
Note that the file option must be the last option specified.
Perf has a variety of other options not described here. Perf can plot ROC curves and precision-recall plots, automatically select thresholds that maximize accuracy or make the frequency of the cases predicted to be positive match the number of cases that are positive in the data set (both of these should be used to find thresholds on train or validation sets, and then you should specify that threshold with the "-threshold" option when testing on test sets -- finding optimal thresholds directly on test sets usually is a no-no), display confusion matrices, calculate cost when unequal costs apply to false positives and false negatives, etc. A tutorial on perf is currently being prepared, but you really won't need for the KDD-CUP.
To see a list of perf's options, run perf with an illegal option such as "-help":
[caruana] perf -help
Error: Unrecognized program option -help
Version 5.00 [KDDCUP-2004]
Usage:
./perf [options] < input
OR ./perf [options] -file <input file>
OR ./perf [options] -files <targets file> <predictions file>
Allowed options:
PERFORMANCE MEASURES
-ACC Accuracy
-RMS Root Mean Squared Error
-CXE Mean Cross-Entropy
-ROC ROC area [default, if nothing else selected]
-R50 ROC area up to 50 negative examples
-SEN Sensitivity
-SPC Specificity
-NPV Negative Predictive Value
-PPV Positive Predictive Value
-PRE Precision
-REC Recall
-PRF F1 score
-PRB Precision/Recall Break Even Point
-APR Mean Average Precision
-LFT Lift (at threshold)
-TOP1 Top 1: is the top ranked case positive
-TOP10 Top 10: is there a positive in the top 10 ranked cases
-NTOP <N> How many positives in the top N ranked cases
-RKL Rank of *last* (poorest ranked) positive case
-NRM <arg> Norm error using L<arg> metric
-CST <tp> <fn> <fp> <tn>
Total cost using these cost values, plus min-cost results
-SAR <wACC> <wROC> <wRMS><br>
SAR = (wACC*ACC + wROC*ROC + wRMS(1-RMS))/(wACC+wROC+wRMS)
typically wACC = wROC = wRMS = 1.0
-CAL <bin_size> CA1/CA2 scores
-SLQ <bin_width> Slac Q-score
METRIC GROUPS
-all display most metrics (the default if no options are specified)
-easy ROC, ACC and RMS
-stats Accuracy, confusion table metrics, lift
-confusion Confusion table plus all metrics in stat
PLOTS (Only one plot is drawn at a time)
-plot roc Draw ROC plot
-plor pr Draw Precision/Recall plot
-plot lift Draw Lift versus threshold plot
-plot cost Draw Cost versus threshold plot
-plot acc Draw Accuracy versus threshold plot
PARAMETERS
-t <arg> Set threshold [default threshold is 0.5 if not set]
-percent <arg> Set threshold so <arg> percent of data falls above threshold (predicted positive)
INPUT
-blocks Input has BLOCK ID numbers in first column. Calculate performance for each block and
report the mean performance across the blocks. Only works with APR, TOP1, RKL, and RMS.
If using separate files for target and predictions input (-file option), the BLOCK ID numbers
must be the first column of the target file, with no block numbers in the predictions file.
-file <file> Read input from one file (1st col targets, 2nd col predictions)
-files <targets file> <predictions file> Read input from two separate files
|