Information Retrieval Evaluation Using Test Collections
Important Dates
Initial submissions due: 30 April 2015
Initial reviewer feedback: 18 June 2015
Revised submissions due: 23 July 2015
Final decisions: 27 August 2015
Test collections have played a vital role in providing a basis for
the measurement and comparison of the effectiveness of different information
retrieval algorithms and techniques. However, test collections also present a
number of issues, from being expensive and complex to construct, to
instantiating a particular abstraction of the retrieval process.
Topics of interest for this special issue include but are not
limited to:
Approaches for constructing new test collections
Test collection stability
Evaluation measures
Relevance judgments
Reflections on existing and past initiatives
Special Issue Editors
Important Dates
Initial submissions due: 30 April 2015
Initial reviewer feedback: 18 June 2015
Revised submissions due: 23 July 2015
Final decisions: 27 August 2015
Information retrieval has a strong history of experimental
evaluation. Test collections -- consisting of a set of queries, a
collection of documents to be searched, and relevance judgments indicating
which documents are relevant for which queries -- are perhaps the most widely
used tool for evaluating the effectiveness of search systems.
Based on pioneering work carried out by Cyril Cleverdon and
colleagues at Cranfield University in the 1960s, the popularity of test
collections has flourished in large part thanks to evaluation campaigns such as
the Text Retrieval Conference (TREC), the Cross-Language Evaluation Forum
(CLEF), the NII Testbeds and Community for Information Access Research project
(NTCIR), and the Forum for Information Retrieval Evaluation (FIRE).
Approaches for constructing new test collections
- choosing representative topics and documents
- minimizing effort
- number of topics and documents required
- completeness of judgments (pooling, stratified sampling, ...)
- choosing measures
- relationship with higher-level search tasks and user goals
- relationship with collection features (assumptions regarding
incomplete judgments, ...)
- approaches for gathering judgments (crowd-sourcing, dedicated
judges, interfaces and support systems for relevance assessments, ...)
- types of judgments (single or multiple assessments, binary or
multi-level, ...)
- human factors (topic creators versus assigned topics, assessor
consistency, instructions to assessors, expertise, potential biases, ...)
Test collections as representations of the search process
- assumptions about search behaviour
- user simulation
Relationship between evaluation using test collections and other
approaches
- test collections in comparison with user studies and log
analyses
Ben Carterette, University of Delaware
Diane Kelly, University of North Carolina
Falk Scholer, RMIT University
No comments:
Post a Comment