WebCrowd25K Dataset

About the Dataset

The dataset includes three related parts:

Crowd Relevance Judgments. 25,099 information retrieval relevance judgments collected on Amazon's Mechanical Turk platform. For each of the 50 search topics from the 2014 NIST TREC WebTrack, we selected 100 ClueWeb12 documents to be re-judged (without reference to the original TREC assessor judgment) by 5 MTurk workers each (50 topics x 100 documents x 5 workers = 25K crowd judgments). Individual worker IDs from the platform are hashed to new identifiers. We collect relevance judgments on a 4-point graded scale. (See SIGIR'18 & HCOMP'18 papers).
Behavioral Data. For a subset of the judgments, we also collected behavioral data charactering worker behavior in performing the relevance judging. Behavioral data was recorded using MmmTurkey, which captures a variety of worker interaction behaviors while completing MTurk Human Intelligence Tasks. (See HCOMP'18 paper)
Disagreement Analysis. We inspected 1000 crowd judgments for 200 documents (5 judgments per document, where the aggregated crowd judgment differs from the original TREC assessor judgment), and we classified each disagreement according to our disagreement taxonomy. (See SIGIR'18 paper.)

Tanya Goyal, Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed and Matthew Lease. Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations. In Proceedings of the 6th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), pages 41-49, 2018.

Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed and Matthew Lease. What Can Rationales behind Relevance Judgments Tell Us About Assessor Disagreement? In Proceedings of the 41st international ACM SIGIR conference on Research and development in Information Retrieval, pages 805-814, 2018.

Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), pages 139-148, 2016. Best Paper Award. [ pdf | blog-post | data | slides ]

Brandon Dang, Miles Hutson, and Matthew Lease. MmmTurkey: A Crowdsourcing Framework for Deploying Tasks and Recording Worker Behavior on Amazon Mechanical Turk. In 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP): Works-in-Progress Track, 2016. 3 pages. arXiv:1609.00945. [ pdf | sourcecode ]

You can download the entire dataset here. Please refer to the included README files and associated publications for further details.

For questions and comments email