WEBSRC401 Dataset

Description

WEBSRC401 is a SRC dataset based on the ClueWeb09 Category B text collection (CCB) and TREC Web Track 2012.
Instead of retrieving relevant Web pages, we are interested in obtaining relevant clusters. So, we transformed the data available in the TREC Web Track 2012 into a typical SRC format, following the steps:

A websnippet of each Web page considered as query-relevant is retrieved using the SnippetGenerator function of ChatNoir permitting a maximum of 500 characters around the query words.
For each query, its subtopics are defined as in the TREC Web Track 2012 and each qrel is encoded in a new format, which contains the Web page id, the subtopic id and the query.

Content

WEBSRC401 is released as a package that contains five files and one folder:

topics.txt: contains topic ID and description
subTopics.txt: contains subtopic ID (formed by topic ID and subtopic number) and description
docs.txt: contains result ID (formed by topic ID and search engine ranking of each result), URL, title, and snippet
STRel.txt contains subtopic ID (formed by topic ID and subtopic number) and result ID (formed by topic ID and order in TREC Web Track 2012)
docid-trecID.txt: contains trec document ID and result ID
querylogs/: folder that contains a file for query. Each file contains the list of querylogs used for the intent evaluation.

Download

WEBSRC401.zip

Evaluation Tools

To replicate the experiments, we suggest the following evaluation tools:

Reliability and Sensitivity from NLP and IR group at UNED
SRCEvaluator (configured for F_1 and F_b^3 metrics)

Paper

José G. Moreno, Gaël Dias and Guillaume Cleuziou. Query Log Driven Web Search Results Clustering. SIGIR 2014. [PDF]