Entropy

Information Theory - 图1

Entropy is a measure of unpredictability of information content.

To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B, the surprise given by the 1/3-2/3 model will be:
Information Theory - 图2

While the correct 1/2-1/2 model will give :

Information Theory - 图3

So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model. Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won’t give a convincing result.

Cross Entropy

Information Theory - 图4

Cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q, rather than the “true” distribution p.

Joint Entropy

Information Theory - 图5

Conditional Entropy

Information Theory - 图6

Mutual Information

Information Theory - 图7

Kullback-Leibler Divergence

Information Theory - 图8