What do we have so far?

  • A feature space with a similarity measure
  • This is a classic learning problem!

We can use a standard classification or clustering method

To solve problems in

  • Keyword-based association analysis
  • Automatic document classification
  • Similarity detection
  • Link analysis: unusual correlation between entities
    • Cluster documents by a common author
    • Cluster documents containing information from a common source
  • Sequence analysis: predicting a recurring event
  • Anomaly detection: find information that violates usual patterns
  • Hypertext analysis
    • Patterns in anchors/links (for example, anchor text correlations with linked objects)

For applications: news article classification, automatic e-mail filtering, Web page classification, hate blogs, etc.