🚀 原文链接:https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analyzing a collection of text documents (newsgroups posts) on twenty different topics.

In this section we will see how to:

  • load the file contents and the categories
  • extract feature vectors suitable for machine learning
  • train a linear model to perform categorization
  • use a grid search strategy to find a good configuration of both the feature extraction components and the classifier

    1、Tutorial setup

    To get started with this tutorial, you must first install scikit-learn and all of its required dependencies.

Please refer to the installation instructions page for more information and for system-specific instructions.

The source of this tutorial can be found within your scikit-learn folder:

  1. scikit-learn/doc/tutorial/text_analytics/

The source can also be found on Github.

The tutorial folder should contain the following sub-folders:

  • *.rst files - the source of the tutorial document written with sphinx
  • data - folder to put the datasets used during the tutorial
  • skeletons - sample incomplete scripts for the exercises
  • solutions - solutions of the exercises

You can already copy the skeletons(框架) into a new folder somewhere on your hard-drive named sklearn_tut_workspace where you will edit your own files for the exercises while keeping the original skeletons intact(完好无损):

  1. $ cp -r skeletons work_directory/sklearn_tut_workspace

Machine learning algorithms need data. Go to each $TUTORIAL_HOME/data sub-folder and run the fetch_data.py script from there (after having read them first).

For instance:

  1. $ cd $TUTORIAL_HOME/data/languages
  2. $ less fetch_data.py
  3. $ python fetch_data.py

2、Loading the 20 newsgroups dataset

The dataset is called “Twenty Newsgroups”. Here is the official description, quoted from the website:

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn. Alternatively, it is possible to download the dataset manually from the website and use the sklearn.datasets.load_files function by pointing it to the 20news-bydate-train sub-folder of the uncompressed archive folder.

In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out of the 20 available in the dataset: