Evaluation of IR
Why do we need evaluation?
- To build IR systems that satisfy user’s information needs 为了满足用户需求
System Efficiency
• Speed
• Storage
• Memory
• Cost System
Effectiveness
• Quality of search result
• Does it find what I’m looking for
• Does it return lots of junk?
To improve system effectiveness
IR system design:
– Which tokenizer? which stemmer?
– Which scoring method?
– tf-idf or wf-idf?
– Length normalization or not?
– Remove stop words?
Example
A test collection is a collection of relevance judgment on (query, document) pairs.
• Query 1
– Doc 1: relevant
– Doc 2: irrelevant
– Doc 3: irrelevant
– Doc 4: relevant
– Doc 5: irrelevant
• Query 2
– Doc 1: irrelevant
– Doc 2: irrelevant
– Doc 3: relevant
– Doc 4: irrelevant
– Doc 5: relevant This relevancy information is known as the ground truth. It is typically constructed by trained human annotators.