机器学习 - Python文本清洗 - 《AI》

标准化
删除停用词
Unicode
删除url，标签，标点符号，提及，等等。
词干提取&词形还原

Python
文本是一种非结构化数据。根据Wikipedia，非结构化数据被描述为“没有预定义的数据模型或没有以预定义的方式组织的信息。
不幸的是，计算机不像人类;机器无法像人类那样阅读原始文本。当处理文本数据时，不能直接从原始文本转换到机器学习模型。相反，必须遵循这样一个过程：首先清理文本，然后将其编码为机器可读的格式。
来介绍一些清理文本的方法。

标准化

当写作时，会因为不同的原因将句子/段落中的不同单词大写。例如，开始一个新句子用大写字母，或者如果某物是名词，会用大写第一个字母来表示正在谈论的是一个地点/人，等等。
人类，可以阅读文本和直观地看出，“The”这是一个句子的开头，而同一个词“the”，这是句子后面的单词。但是，电脑不能——“The”和“the”被视为两个不同的词。
因此，标准化单词的大小写是很重要的，这样每个单词都是相同的大小写，而计算机不会将同一个单词处理为两个不同的单词。

# Python例子
text = "The UK lockdown restrictions will be dropped in the summer so we can go partying again!" 
# 小写化文本
text = text.lower()
print(text)
>>>> the uk lockdown restrictions will be dropped in the summer so we can go partying again!

删除停用词

在大多数自然语言任务中，希望机器学习模型识别文档中为文档提供价值的单词。例如，在一个情感分析任务中，希望找到指向文本情感的一个(或多个)词。
英语(我相信同样也适用于大多数语言)，有些词比其他的词更频繁地使用，但他们不一定能创造更多的价值，因此可以肯定地说，可以忽略它们，在文本中删除。
注意：移除停用词并不总是最好的主意!

# 导入库
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
print(stop_words)
>>>> {'over', 'is', 'than', 'can', 'these', "isn't", 'so', 'my', 'each', 'an', 'between', 'through', 'up', 'where', 'hadn', 'very', "you'll", 'while', "weren't", 'too', 'doesn', 'only', 'needn', 'has', 'just', 'd', 'some', 'into', 've', 'didn', 'further', 'why', 'mightn', 'and', 'haven', 'own', "mightn't", 'during', 'both', 'me', 'shan', "doesn't", 'theirs', 'herself', 'the', 'few', 'our', 'its', 'yourself', 'under', 'at', "you've", 're', 'themselves', 'y', 'ma', 'because', 'him', 'above', 'such', 'we', "wouldn't", 'of', 'from', 'hers', 'nor', "shouldn't", 'a', 'hasn', 'them', 'myself', 'this', 'being', 'your', 'those', 'i', 'if', 'couldn', 'not', 'will', 'it', 'm', 'to', 'isn', 'aren', 'when', 'o', 'about', 'their', 'more', 'been', "needn't", 'had', 'll', 'most', 'against', 'once', 'how', "didn't", "shan't", 'there', 'all', "should've", 'he', "don't", 'she', 'which', 'below', 'on', 'no', 'yourselves', "wasn't", 'shouldn', 'by', 'be', 'have', 'does', "aren't", 'itself', 'same', 'should', 'in', 'before', 'am', "won't", 'having', "you'd", 'mustn', 'for', "that'll", 'that', "couldn't", 'wasn', 'won', "hasn't", 'as', 'until', 'wouldn', "mustn't", 'his', 'ain', "you're", 'out', "she's", 'other', 'are', 't', 'you', 'off', 'yours', 'ourselves', 'himself', 'down', "haven't", 'ours', 'now', "hadn't", 'do', 's', 'her', 'with', "it's", 'then', 'weren', 'any', 'after', 'whom', 'what', 'who', 'but', 'again', 'here', 'did', 'doing', 'were', 'they', 'was', 'or', 'don'}
# 示例文本
text = "The UK lockdown restrictions will be dropped in the summer so we can go partying again!"
# 删除stopwords
text = " ".join([word for word in text.split() if word not in stop_words])
print(text)
>>>> uk lockdown restrictions dropped summer go partying again!

Unicode

表情符号和其他非ASCII字符应该格式化为Unicode。
从本质上讲，Unicode是一种通用的字符编码标准，所有语言中的每个字符和符号都被赋予一个代码。Unicode是必需的，因为允许使用各种不同的语言检索或连接数据。
注意：示例代码来自Python指南

# 创建unicode字符串
text_unicode = "Python is easy \u200c to learn" 
# 将文本编码为ASCII格式
text_encode = text_unicode.encode(encoding="ascii", errors="ignore")
# 解码文本
text_decode = text_encode.decode()
# 清除文本以删除额外的空白
clean_text = " ".join([word for word in text_decode.split()])
print(clean_text)
>>>> Python is easy to learn.

删除url，标签，标点符号，提及，等等。

根据处理的数据类型，可能会面临各种各样的挑战。例如，如果在处理来自Twitter的数据，就会发现各种各样的标签和提及——用Twitter的行话来说，这是指一条包含另一个用户名的推文。
如果这些特征对试图解决的问题没有价值，那么最好从数据中删除它们。然而，由于在许多实例中不能依赖于已定义的字符，所以可以利用名为Regex的模式匹配工具的强大功能来帮助我们。

import re
# 删除提及
text = "You should get @BlockFiZac from @BlockFi to talk about bitcoin lending, stablecoins, institution adoption, and the future of crypto"
text = re.sub("@\S+", "", text)
print(text)
>>>> You should get  from  to talk about bitcoin lending, stablecoins, institution adoption, and the future of crypto
------------------------------------------------------------------
# 删除记号
text = """#BITCOIN LOVES MARCH 13th A year ago the price of Bitcoin collapsed to $3,800 one of the lowest levels in the last 4 years. Today, exactly one year later it reaches the new all-time high of $60,000 Thank you Bitcoin for always making my birthday exciting"""
text = re.sub("\$", "", text)
print(text)
>>>> #BITCOIN LOVES MARCH 13th A year ago the price of Bitcoin collapsed to  3,800 one of the lowest levels in the last 4 years. Today, exactly one year  later it reaches the new all-time high of 60,000 Thank you Bitcoin for  always making my birthday exciting
------------------------------------------------------------------
# 删除urls
text = "Did someone just say “Feature Engineering”? https://buff.ly/3rRzL0s"
text = re.sub("https?:\/\/.*[\r\n]*", "", text)
print(text)
>>>> Did someone just say “Feature Engineering”?
------------------------------------------------------------------
# 删除hashtags 
text = """.#FreedomofExpression which includes #FreedomToProtest should be the cornerstone of any democracy. I’m looking forward to speaking in the 2 day debate on the #PoliceCrackdownBill & explaining why I will be voting against it."""
text = re.sub("#", "", text)
print(text)
>>>> .FreedomofExpression which includes FreedomToProtest should be the  cornerstone of any democracy. I’m looking forward to speaking in the 2 day  debate on the PoliceCrackdownBill & explaining why I will be voting against it.
------------------------------------------------------------------
# 删除标点符号
import string 
text = "Thank you! Not making sense. Just adding, lots of random punctuation."
punct = set(string.punctuation) 
text = "".join([ch for ch in tweet if ch not in punct])
print(text)
>>>> Thank you Not making sense Just adding lots of random punctuation

词干提取&词形还原

当在做一个NLP任务时，可能希望电脑能够理解“walked”、“walk”和“walking”都只是同一个单词的不同时态，否则它们就会被区别对待。
词干提取和词形还原都是用于NLP中规范化文本的技术——为了进一步简化这个定义，只是将一个单词简化到它的核心词根。
根据维基百科的定义：

词干提取——在语言形态学和信息检索中，词干提取是将单词缩减为词干、词根或词根的过程。
词形还原——引理化在语言学中是指将一个单词的各种形式还原初始形式，这样它们就可以作为一个单词来分析。

尽管这两种技术的定义非常相似，但它们减少单词的方法却截然不同，这意味着这两种技术的结果并不一致。

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
words = ["walk", "walking", "walked", "walks", "ran", "run", "running", "runs"]
-----------------------------------------------------------------
# 词干提取
stemmer = PorterStemmer()
for word in words: 
    print(word + " ---> " + stemmer.stem(word))
>>>> walk ---> walk
     walking ---> walk
     walked ---> walk
     walks ---> walk
     ran ---> ran
     run ---> run
     running ---> run
     runs ---> run
------------------------------------------------------------------
# 词形还原
lemmatizer = WordNetLemmatizer()
for word in words:
    print(word + " ---> " + lemmatizer.lemmatize(word))
>>>> walk ---> walk 
     walking ---> walking 
     walked ---> walked 
     walks ---> walk 
     ran ---> ran 
     run ---> run 
     running ---> running 
     runs ---> run