Norbert Fuhr's recommendations for gaining scientific knowledge from experiments:

1. Do not use MRR or MAP;
2. Instead of relative improvements, regard the effect size!
3. For multiple significance tests, use a correction, such as Bonferoni or Tukey's HSD (NB comparing only to the 2nd best method does not help!)
4. There are no significant improvements for re-usable test collections! (hypotheses have to be formulated before the work)


5. Ignore results for collections when there are no baseline from independent research;
6. Test collections wear out! Expected maximum result increases with number of runs (on see leaderboards! -- Carterette's SIGIR paper)
7. Conferences and journals need to accept papers with "null" results. (to prevent the busy beaver / the p-hackers) Reproducibility is important: Publish your code and data
8. Evaluation initiatives are important (but they should only run proper measures and methods)

