🚀 原文链接:https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
The goal of this guide is to explore some of the main scikit-learn
tools on a single practical task: analyzing a collection of text documents (newsgroups posts) on twenty different topics.
In this section we will see how to:
- load the file contents and the categories
- extract feature vectors suitable for machine learning
- train a linear model to perform categorization
- use a grid search strategy to find a good configuration of both the feature extraction components and the classifier
1、Tutorial setup
To get started with this tutorial, you must first install scikit-learn and all of its required dependencies.
Please refer to the installation instructions page for more information and for system-specific instructions.
The source of this tutorial can be found within your scikit-learn folder:
scikit-learn/doc/tutorial/text_analytics/
The source can also be found on Github.
The tutorial folder should contain the following sub-folders:
*.rst files
- the source of the tutorial document written with sphinxdata
- folder to put the datasets used during the tutorialskeletons
- sample incomplete scripts for the exercisessolutions
- solutions of the exercises
You can already copy the skeletons(框架) into a new folder somewhere on your hard-drive named sklearn_tut_workspace
where you will edit your own files for the exercises while keeping the original skeletons intact(完好无损):
$ cp -r skeletons work_directory/sklearn_tut_workspace
Machine learning algorithms need data. Go to each $TUTORIAL_HOME/data
sub-folder and run the fetch_data.py
script from there (after having read them first).
For instance:
$ cd $TUTORIAL_HOME/data/languages
$ less fetch_data.py
$ python fetch_data.py
2、Loading the 20 newsgroups dataset
The dataset is called “Twenty Newsgroups”. Here is the official description, quoted from the website:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn. Alternatively, it is possible to download the dataset manually from the website and use the sklearn.datasets.load_files
function by pointing it to the 20news-bydate-train
sub-folder of the uncompressed archive folder.
In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out of the 20 available in the dataset: