Evaluation of IR

Why do we need evaluation?

  • To build IR systems that satisfy user’s information needs 为了满足用户需求

System Efficiency
• Speed
• Storage
• Memory
• Cost System

Effectiveness
• Quality of search result
• Does it find what I’m looking for
• Does it return lots of junk?

To improve system effectiveness

IR system design:
– Which tokenizer? which stemmer?
– Which scoring method?
– tf-idf or wf-idf?
– Length normalization or not?
– Remove stop words?

Example

A test collection is a collection of relevance judgment on (query, document) pairs.
• Query 1
– Doc 1: relevant
– Doc 2: irrelevant
– Doc 3: irrelevant
– Doc 4: relevant
– Doc 5: irrelevant
• Query 2
– Doc 1: irrelevant
– Doc 2: irrelevant
– Doc 3: relevant
– Doc 4: irrelevant
– Doc 5: relevant This relevancy information is known as the ground truth. It is typically constructed by trained human annotators.