To facilitate future research and comparison, we set up an online photo triage benchmark hosted by CodaLab.

We divide the whole dataset into training, validation and (held-out) test set. Out of the 5,953 series, 4560 are randomly sampled for training, 195 for validation, and the remaining 967 for testing. One purpose of the validation set is to allow researchers to test for themselves how the evaluation works, prior to submitting a solution to the online challenge. New results on the testing set can be submitted through the CodaLab server, and will be evaluated in two ways: series-level and pair-level.

• Series level: we evaluate the likelihood of the best image beats all other images in the series given human preference. For a series k, given the human preferences on each image, we identify the best on chose by human as winner(k). For a method M, we apply it on a number of image pairs, each includes winner(k) and another image, and compute the sum of the logarithms of the prediction probablities based on the decisions M makes on each pair:

$log ⁡ L k ( M ) = ∑ i ≠ w i n n e r ( k ) log ⁡ P ( M , w i n n e r ( k ) , i )$ where
• Pair level: considering that some pairs have a clear human preference while others do not, we only choose pairs where the majority agreements is over 70%, and compute the accuracy of different methods on these high-consistency pairs.

See more details about the evaluation in the paper.