WEBSRC401 Dataset
Description
WEBSRC401 is a SRC dataset based on the ClueWeb09 Category B text collection (CCB) and TREC Web Track 2012.
Instead of retrieving relevant Web pages, we are interested in obtaining relevant clusters. So, we transformed the data available in the TREC Web Track 2012 into a typical SRC format, following the steps:
- A websnippet of each Web page considered as query-relevant is retrieved using the SnippetGenerator function of ChatNoir permitting a maximum of 500 characters around the query words.
- For each query, its subtopics are defined as in the TREC Web Track 2012 and each qrel is encoded in a new format, which contains the Web page id, the subtopic id and the query.
Content
WEBSRC401 is released as a package that contains five files and one folder:
- topics.txt: contains topic ID and description
- subTopics.txt: contains subtopic ID (formed by topic ID and subtopic number) and description
- docs.txt: contains result ID (formed by topic ID and search engine ranking of each result), URL, title, and snippet
- STRel.txt contains subtopic ID (formed by topic ID and subtopic number) and result ID (formed by topic ID and order in TREC Web Track 2012)
- docid-trecID.txt: contains trec document ID and result ID
- querylogs/: folder that contains a file for query. Each file contains the list of querylogs used for the intent evaluation.
Download
WEBSRC401.zip
Evaluation Tools
To replicate the experiments, we suggest the following evaluation tools:
Paper
- José G. Moreno, Gaël Dias and Guillaume Cleuziou. Query Log Driven Web Search Results Clustering. SIGIR 2014. [PDF]