Results -- SQUARE: A Benchmark for Research on Computing Crowd Consensus

Fig1. Unsupervised performance of consensus methods, as measured across nine binary labeled real datasets. Accuracy is plotted relative to a Majority Vote (MV) baseline. Average performance of methods across all datasets is shown at the right. On multiple choice WSD and multi-class AC2 and HC, results are reported only for DS and ZC.

Fig2. Light-supervision: Results across original datasets with increasing training set size (10% - 90%)

Fig3. Full-supervision: Results across original datasets with increasing training set size (10% - 90%)

Fig4. Histogram shows the distribution of worker accuracies across nine of the datasets

Fig5. Histogram shows examples labeled per worker