Good to read Sakai's reply to Fuhr’s Guidelines for Information Retrieval Evaluation:

@djoerd Sakai made some good points. But. MAP's user model was reverse engineered decades after its inception. And what user model would actually consider the differences between ranks 1 and 2 and between 2 and inf. the same?
Maybe we as an IR community should identify classes of user models (e.g. "adhoc", "automated") and identify the best known measure for each class. Otherwise papers will be tempted to cherry-pick the measure that best shows the advantage of the paper's contribution.

@tfidf @djoerd well, most researchers do! We report NDCG@20 as a model of first result page quality, MAP as a model averaging over all users, and P@5 as a model of early precision. Each have their pro and cons, that is why you should not report just one. And, it should match the use-case too!

@arjen @tfidf I personally like MAP a lot as a measure (assuming binary relevance). According to Buckley & Voorhees, "Average Precision seems to be a reasonably stable and discriminating choice."

@arjen @tfidf I think for MAP, the difference between rank 1 and 2 is usually smaller than between 2 and infinity (assuming there are multiple relevant documents). If there is one relevant document MAP equals MRR, and yes, the difference between rank 1 and 2 is big.

@djoerd @arjen I'm a bit behind with regard to reading papers but there has been at least one paper that showed ad-hoc users can adapt well if retrieval engine effectiveness deteriorates. So ranks 1 and 2 might not make a big difference if the snippets are good, while 2 and inf. certainly do.

@tfidf @arjen I see your point. What measure would capture that better?

@djoerd @tfidf Maybe RBP captures it more? P@20 together with MAP is pretty good too, not?

@arjen @tfidf Rank Biased Precision by Moffat & Zobel, right? (Now I also need to read up on my papers)

@djoerd @arjen how about nDCG@6? We can make it 10. But 20 is way too much in my opinion, if we want to measure just the first result page from a user's perspective.

@arjen @djoerd anything@20 seems like a strange result page measure since that eyetracking paper by Joachims showed most web users can't get themselves to scroll to the 7th result item. But it's good to know there are standard measures. Maybe PCs can adopt a good list of dos and don'ts incorporating the widely accepted ones of the Fuhr criteria and things like these standard measures.

Sign in to participate in the conversation

The "unofficial" Information Retrieval Mastodon Instance.

Goal: Make a viable and valuable social space for anyone working in Information Retrieval and related scientific research.

Everyone welcome but expect some level of geekiness on the instance and federated timelines.