Evaluation of IR
Why do we need evaluation?
- To build IR systems that satisfy user’s information needs 为了满足用户需求
System Efficiency 
• Speed 
• Storage 
• Memory 
• Cost System 
Effectiveness
• Quality of search result 
• Does it find what I’m looking for 
• Does it return lots of junk?
To improve system effectiveness
IR system design: 
– Which tokenizer? which stemmer? 
– Which scoring method? 
– tf-idf or wf-idf?
– Length normalization or not? 
– Remove stop words?
Example
A test collection is a collection of relevance judgment on (query, document) pairs. 
• Query 1 
– Doc 1: relevant 
– Doc 2: irrelevant 
– Doc 3: irrelevant 
– Doc 4: relevant 
– Doc 5: irrelevant 
• Query 2 
– Doc 1: irrelevant 
– Doc 2: irrelevant 
– Doc 3: relevant 
– Doc 4: irrelevant 
– Doc 5: relevant This relevancy information is known as the ground truth. It is typically constructed by trained human annotators.
 
                         
                                

