Abstract

We participated in the Learning Task of the TREC 2010 Legal Track, focusing solely on estimating probabilities of relevance. We submitted three automated runs based on the same tf.idf ranking, produced by the topic narratives and positive-only feedback of the training data in equal contributions. The runs differ in the way the probabilities of relevance are estimated: (1) DUTHsdtA employed the Truncated Normal-Exponential model to turn scores to probabilities. (2) DUTHsdeA did not assume any specific component score distributions but estimated those on the scores of training data via Kernel Density Estimation (KDE) methods. (3) DUTHlrgA used Logistic Regression with the co-efficients estimated on the scores of training data. We found that DUTHsdeA and DUTHlrgA are greatly affected by biases in the training set, since they assume that input score data are uniformly sampled. Also, KDE was found to be very sensitive to its parameters, influencing greatly the probability estimates. In these respects, DUTHsdtA was proven to be the most robust method.