Description of Performance Metrics

 ```METRICS FOR THE PARTICLE PHYSICS PROBLEM: *** Accuracy (maximize): We use the usual definition of accuracy -- the number of cases predicted correctly, divided by the total number of cases. The predictions submitted for accuracy can be boolean or continuous. When submitting predictions, you will be asked for a prediction threshold: predictions above or equal to this threshold will be treated as class 1, predictions below threshold are treated as class 0. An accuracy of 1.00 is perfect prediction. Accuracy near 0.00 is poor. *** Area Under the ROC Curve (maximize): We use the usual definition for ROC Area (often abbreviated AUC). An ROC plot is a plot of true positive rate vs. false positive rate as the prediction threshold sweeps through all the possible values. If you are familiar with sensitivity and specificity, this is the same as plotting sensitivity vs. 1-specificity as you sweep the threshold. AUC is the area under this curve. AUC of 1 is perfect prediction -- all positive cases sorted above all negative cases. AUC of 0.5 is random prediction -- there is no relationship between the predicted values and truth. AUC below 0.5 indicates there is a relationship between predicted values and truth, but the model is backwards, i.e., predicts smaller values for positive cases! Another way to think of ROC Area is to imagine sorting the data by predicted values. Suppose this sort is not perfect, i.e., some positive cases sort below some negative cases. AUC effectively measures how many times you would have to swap cases with their neighbors to repair the sort. # of swaps to repair sort AUC = 1.0 - ---------------------------------- (# of positives) x (# of negatives) There's a little more info on AUC in the comments at the top of the perf source code, but you probably don't need to look at it. *** Cross-Entropy (minimize): We use the usual definition for cross-entropy. Cross-entropy, like squared error, measures how close predicted values are to target values. Unlike squared error, cross-entropy assumes the predicted values are probabilities on the interval [0,1] that indicate the probability that the case is class 1. Cross-Entropy = SUM [(Targ)*log(Pred) + (1-Targ)*log(1-Pred)] where Targ is the Target Class (0 or 1) and Pred is the predicted probability that the case is class 1. Note that the terms (Targ) and (1-Targ) are alternately 0 or 1 so log(Pred) is added to the sum for positive cases and log(1-Pred) is added for negative cases. Note that cross entropy is infinity of the Targ is 0 and Pred is 1 or Targ is 1 and Pred is 0. In the code we protect against this by just returning a very, very large number (that may be platform dependent). If you get cross-entropies that are very large, you probably have at least one prediction that is like this. To make cross-entropy independent of data set size, we use the mean cross-entropy, i.e., the sum of the cross-entropy for each case divided by the total number of cases. *** SLAC Q-Score (maximize): SLQ is a domain-specific performance metric devised by researchers at the Stanford Linear Accellerator (SLAC) to measure the performance of predictions made for certain kinds of particle physics problems. SLQ works with models that make continuous predictions on the interval [0-1]. It breaks this interval into a series of bins. For the KDD-CUP we are using 100 equally sized bins. The 100 bins are: 0.00-0.01, 0.01-0.02, ..., 0.98-0.99, 0.99-1.00. PERF's SLQ option allows you to specify the number of bins. You should use 100 bins for the KDD-CUP competition. For example under unix: perf -slq 100 < testdata SLQ places predictions into the bins based on the predicted values. If your model predicts a value of 0.025 for a case, that case will be placed in the third bin 0.02-0.03. In each bin SLQ keeps track of the number of true positives and true negatives. SLQ is maximized if bins have high purity, e.g., if all bins contain all 0's or all 1's. This is unlikely, so SLQ computes a score based on how pure the bins are. The score is computed as follows: suppose a bin has 350 positives and 150 negatives. The error of this bin if we predict positives (the predominant class) for all cases is 150/(350+150) = 0.30 = err. The contribution of this bin to the total SLQ score is (1-2*err)^2, times the total number of points in this bin divided by the total number of points in the data set. Dividing by the size of the total data set normalizes the score so that it is more independent of the size of the data set. Multiplying by the number of points in the bin gives more weight to bins that have more points, and less weight to bins with fewer points. The term (1-2*err)^2 is maximized (= 1) when all points in that bin are the same class, i.e., when the error due to predicting the predominant class is minimized. This is somewhat similar to measures of node purity in decision trees. The sum over the 100 bins is maximized when the contribution of each bin (weighted by the number of points in each bin) is maximized. Collaborators at SLAC say that this is an important quantity for their work. They want to find models that maximize it. Other than that, we do not know exactly why the SLQ metric is defined this way. SLQ is a domain-specific performance metric custom designed for this particular class of problems. Note that SLQ only cares about the purity in each node. A model would have poor accuracy and ROC below 0.5 if you switch the labels used for positive and negatives after training, but SLQ is insensitive to this. METRICS FOR THE BIOLOGY/PROTEIN PROBLEM: Unlike the particle physics problem, the biology problem comes in blocks. Performance is measured on each block independently, and then the average performance across the blocks is used as the final performance metric. *** TOP1 (maximize): The fraction of blocks with a true homolog (class 1) predicted as the best match. In each block the predictions are sorted by predicted value. If the case that sorts to the top (highest predicted value) is class 1, you score a 1 for that block. If the top case is class 0, you score 0. (If there are ties for the top case, you score 0 unless all of the tied cases are class 1. This means ties do not help.) The best TOP1 that can be achieved is to predict the highest value for a true homolog in each block. This would yield TOP1 = 1.0, perfect TOP1 prediction. TOP1 = 0.0 is poor. *** Average Rank of Last Homolog (minimize): Like TOP1, cases are sorted by predicted value. This metric measures how far down the sorted cases the last true homolog falls. A robust model would predict high values for all true homologs, so the last homolog would still sort high in the ordered cases. A rank of 1 means that the last homolog sorted to the top position. This is excellent, but can only happen if there is only 1 true homolog in the block. As with TOP1, ties do not help -- if cases are tied perf assumes the true homologs sort to the low end of the ties. It is not possible to achieve an average rank for the last homolog of 1 on this data because most blocks have more than one homolog. Average ranks near 1000 indicate poor performance. *** Root-Mean-Squared-Error (minimze): We use the usual defintion for squared error, and then divide it by the number of cases to make it independent of data set size, and then take the square root to convert it back to the natural scale of the targets and predictions. RMSE = SQRT( 1/N * SUM (Targ-Pred)^2 ) For the KDD-CUP, the targets should be 0 and 1 and the predictions should be on the interval [0,1]. *** Mean Average Precision (maximize): This is the metric that we got wrong in the first release of PERF. Hopefully you are using the newer release of PERF. One problem with average precision is that there are a variety of definitions in the literature. To further the confusion, we are using a relatively non-standard definition sometimes called "expected precision". If you are familiar with precision/recall plots, precision is the percentage of points predicted to be positive (above some threshold) that are positives. High precision is good. We sweep the threshold so that each case becomes predicted positive one at a time, calculate the precision at each of these thresholds, and then take the average of all of these precisions. This is kind of like the area under a precision recall plot, but not exactly. One problem with average precision is how to handle ties between positive and negative cases. For the KDD-CUP competition we developed a new method for handling ties that calculates the average precision of all possible orderings of the cases that are tied. Fortunately, we are able to do this without enumerating all possible orderings! Unfortuantely, it is still expensive when there are many ties (order n^2 in the number of ties), so the code backs off to simpler pessimistic calculation when there are thousands of ties. (You shouldn't run into this in the KDD-CUP since each block has only about 1000 cases.) If you are not familiar with precision, there is a description of it in the comments at the top of the source code for PERF. But you probably don't need to look at it. Average precision is measured is measured on each block, and then the mean of each block's average precision is used as the final metric. A mean average precision of 1 is perfect prediction. The lowest mean average precision possible is 0.0. ```