KDD Cup 2004

Introduction

This year's competition focuses on various performance measures for supervised classification. Real world applications of data-mining typically require optimizing for non-standard performance measures that depend on the application. For example, in direct marketing, error rate is a bad indicator of performance, since there is a strong imbalance in cost between missing a customer and making the advertisement effort slightly too broad. Even for the same dataset, we often want to have different classification rules that optimize different criteria.

In this year's competition, we ask you to design classification rules that optimize a particular performance measure (e.g. accuracy, ROC area, lift). For each of the two datasets described below, we provide a supervised training set and a test set where we held back the correct labels. For each of the two test sets, you will submit multiple sets of predictions. Each set is supposed to maximize the performance according to a particular measure. We will provide software for computing the performance measures in the software area. The same software will be used to determine the winners of the competition. The particular performance measures are described in the documentation of the software.

Particle Physics Task

The goal in this task is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments. It is a binary classification problem with 78 attributes. The training set has 50,000 examples, and the test set is of size 100,000. Your task is provide 4 sets of predictions for the test set that optimize

accuracy (maximize)
ROC area (maximize)
cross entropy (minimize)
q-score: a domain-specific performance measure sometimes used in particle physics to measure the effectiveness of prediction models. (maximize: larger q-scores indicate better performance.)

We thank the author of the dataset, who wishes to remain anonymous until after the KDD-Cup.

Software that calculates q-score (and the other seven performance measures) is available from the Software download web page. We'll add a description of q-score to the FAQ web page in the next few days. Until then, just think of it as a black-box performance metric for which you are supposed train models that will maximize this score.

Protein Homology Prediction Task

Additional information about the Protein Homology problem is now available from Ron Elber.

For this task, the goal is to predict which proteins are homologous to a native sequence. The data is grouped into blocks around each native sequence. We provide 153 native sequences as the training set, and 150 native sequences as the test set. For each native sequences, there is a set of approximately 1000 protein sequences for which homology prediction is needed. Homologous sequences are marked as positive examples, non-homologous sequences (also called "decoys") are negative examples.

If you are not familiar with protein matching and structure prediction, it might be helpful to think of this as a WWW search engine problem. There are 153 queries in the train set, and 150 queries in the test set. For each query there are about 1000 returned documents, only a few of the documents are correct matches for each query. The goal is to predict which of the 1000 documents best match the query based on 74 attributes that measure match. A good set of match predictions will rank the "homologous" documents near the top.

Evaluation measures are applied to each block corresponding to a native sequence, and then averaged over all blocks. Most of the measures are rank based and assume that your predictions provide a ranking within each block. Your task is provide 4 sets of predictions for the test set that optimize

fraction of blocks with a homologous sequence ranked top 1 (maximize)
average rank of the lowest ranked homologous sequence (minimize)
root mean squared error (minimize)
average precision (average of the average precision in each block) (maximize)

Note that three of the measures (TOP 1, Average Last Rank, and average precision) depend only on the relative ordering of the matches within each block, not on the predicted values themselves.

We thank the author of the dataset, who wishes to remain anonymous until after the KDD-Cup.

Home

KDD 2004 Conference