主题建模 - LDA建模 - 《算法》

原理略
用途：找出文档主题、摘要
sklearn主要参数：
1) n_components: 即我们的隐含主题数K,需要调参。K的大小取决于我们对主题划分的需求，比如我们只需要类似区分是动物，植物，还是非生物这样的粗粒度需求，那么K值可以取的很小，个位数即可。如果我们的目标是类似区分不同的动物以及不同的植物，不同的非生物这样的细粒度需求，则K值需要取的很大，比如上千上万。此时要求我们的训练文档数量要非常的多。
2) doc_topic_prior:即我们的文档主题先验Dirichlet分布θd的参数α。一般如果我们没有主题分布的先验知识，可以使用默认值1/K。
3) topic_word_prior:即我们的主题词先验Dirichlet分布βk的参数η。一般如果我们没有主题分布的先验知识，可以使用默认值1/K。
4) learning_method: 即LDA的求解算法。有 ‘batch’ 和 ‘online’两种选择。 ‘batch’即我们在原理篇讲的变分推断EM算法，而”online”即在线变分推断EM算法，在”batch”的基础上引入了分步训练，将训练样本分批，逐步一批批的用样本更新主题词分布的算法。默认是”online”。选择了‘online’则我们可以在训练时使用partial_fit函数分布训练。不过在scikit-learn 0.20版本中默认算法会改回到”batch”。建议样本量不大只是用来学习的话用”batch”比较好，这样可以少很多参数要调。而样本太多太大的话，”online”则是首先了。
5）learning_decay：仅仅在算法使用”online”时有意义，取值最好在(0.5, 1.0]，以保证”online”算法渐进的收敛。主要控制”online”算法的学习率，默认是0.7。一般不用修改这个参数。
6）learning_offset：仅仅在算法使用”online”时有意义，取值要大于1。用来减小前面训练样本批次对最终模型的影响。
7） max_iter ：EM算法的最大迭代次数。
8）total_samples：仅仅在算法使用”online”时有意义，即分步训练时每一批文档样本的数量。在使用partial_fit函数时需要。
9）batch_size: 仅仅在算法使用”online”时有意义，即每次EM算法迭代时使用的文档样本的数量。
10）mean_change_tol :即E步更新变分参数的阈值，所有变分参数更新小于阈值则E步结束，转入M步。一般不用修改默认值。
11） max_doc_update_iter: 即E步更新变分参数的最大迭代次数，如果E步迭代次数达到阈值，则转入M步。

从上面可以看出，如果learning_method使用”batch”算法，则需要注意的参数较少，则如果使用”online”,则需要注意”learning_decay”, “learning_offset”，“total_samples”和“batch_size”等参数。无论是”batch”还是”online”, n_components(K), doc_topic_prior(α), topic_word_prior(η)都要注意。如果没有先验知识，则主要关注与主题数K。可以说，主题数K是LDA主题模型最重要的超参数。
官方样例：
注：LDA采用的是词袋模型，不要用tf-idf

from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics to filter out useless terms early on: the posts are stripped of headers,  footers and quoted replies, and common English words, words occurring in only one document or in at least 95% of the documents are removed.
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)
data_samples = data[:n_samples]
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

结果：

Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
Topics in LDA model:
Topic #0: hiv health aids disease medical care study research said 1993 national april service children test information rules page new dr
Topic #1: drive car disk hard drives game power speed card just like good controller new year bios rom better team got
Topic #2: edu com mail windows file send graphics use version ftp pc thanks available program help files using software time know
Topic #3: vs gm thanks win interested copies john email text st mail copy hi new book division edu buying advance know
Topic #4: performance wanted robert speed couldn math ok change address include organization mr science major university internet edu computer driver kept
Topic #5: space scsi earth moon surface probe lunar orbit mission nasa launch science mars energy bit printer spacecraft probes sci solar
Topic #6: israel 000 section turkish military armenian greek killed state armenians people population attacks women israeli men weapon division dangerous jews
Topic #7: 10 55 11 15 18 12 20 00 13 93 16 19 period 14 17 23 25 22 24 21
Topic #8: key government people law public chip church encryption clipper used keys use god christian rights person security private enforcement fact
Topic #9: just don people like think know time say god good way make does did want right really going said things