Python
文本是一种非结构化数据。根据Wikipedia,非结构化数据被描述为“没有预定义的数据模型或没有以预定义的方式组织的信息。
不幸的是,计算机不像人类;机器无法像人类那样阅读原始文本。当处理文本数据时,不能直接从原始文本转换到机器学习模型。相反,必须遵循这样一个过程:首先清理文本,然后将其编码为机器可读的格式。
来介绍一些清理文本的方法。

标准化

当写作时,会因为不同的原因将句子/段落中的不同单词大写。例如,开始一个新句子用大写字母,或者如果某物是名词,会用大写第一个字母来表示正在谈论的是一个地点/人,等等。
人类,可以阅读文本和直观地看出,“The”这是一个句子的开头,而同一个词“the”,这是句子后面的单词。但是,电脑不能——“The”和“the”被视为两个不同的词。
因此,标准化单词的大小写是很重要的,这样每个单词都是相同的大小写,而计算机不会将同一个单词处理为两个不同的单词。

  1. # Python例子
  2. text = "The UK lockdown restrictions will be dropped in the summer so we can go partying again!"
  3. # 小写化文本
  4. text = text.lower()
  5. print(text)
  6. >>>> the uk lockdown restrictions will be dropped in the summer so we can go partying again!

删除停用词

在大多数自然语言任务中,希望机器学习模型识别文档中为文档提供价值的单词。例如,在一个情感分析任务中,希望找到指向文本情感的一个(或多个)词。
英语(我相信同样也适用于大多数语言),有些词比其他的词更频繁地使用,但他们不一定能创造更多的价值,因此可以肯定地说,可以忽略它们,在文本中删除。
注意:移除停用词并不总是最好的主意!

  1. # 导入库
  2. import nltk
  3. from nltk.corpus import stopwords
  4. nltk.download("stopwords")
  5. stop_words = set(stopwords.words("english"))
  6. print(stop_words)
  7. >>>> {'over', 'is', 'than', 'can', 'these', "isn't", 'so', 'my', 'each', 'an', 'between', 'through', 'up', 'where', 'hadn', 'very', "you'll", 'while', "weren't", 'too', 'doesn', 'only', 'needn', 'has', 'just', 'd', 'some', 'into', 've', 'didn', 'further', 'why', 'mightn', 'and', 'haven', 'own', "mightn't", 'during', 'both', 'me', 'shan', "doesn't", 'theirs', 'herself', 'the', 'few', 'our', 'its', 'yourself', 'under', 'at', "you've", 're', 'themselves', 'y', 'ma', 'because', 'him', 'above', 'such', 'we', "wouldn't", 'of', 'from', 'hers', 'nor', "shouldn't", 'a', 'hasn', 'them', 'myself', 'this', 'being', 'your', 'those', 'i', 'if', 'couldn', 'not', 'will', 'it', 'm', 'to', 'isn', 'aren', 'when', 'o', 'about', 'their', 'more', 'been', "needn't", 'had', 'll', 'most', 'against', 'once', 'how', "didn't", "shan't", 'there', 'all', "should've", 'he', "don't", 'she', 'which', 'below', 'on', 'no', 'yourselves', "wasn't", 'shouldn', 'by', 'be', 'have', 'does', "aren't", 'itself', 'same', 'should', 'in', 'before', 'am', "won't", 'having', "you'd", 'mustn', 'for', "that'll", 'that', "couldn't", 'wasn', 'won', "hasn't", 'as', 'until', 'wouldn', "mustn't", 'his', 'ain', "you're", 'out', "she's", 'other', 'are', 't', 'you', 'off', 'yours', 'ourselves', 'himself', 'down', "haven't", 'ours', 'now', "hadn't", 'do', 's', 'her', 'with', "it's", 'then', 'weren', 'any', 'after', 'whom', 'what', 'who', 'but', 'again', 'here', 'did', 'doing', 'were', 'they', 'was', 'or', 'don'}
  8. # 示例文本
  9. text = "The UK lockdown restrictions will be dropped in the summer so we can go partying again!"
  10. # 删除stopwords
  11. text = " ".join([word for word in text.split() if word not in stop_words])
  12. print(text)
  13. >>>> uk lockdown restrictions dropped summer go partying again!

Unicode

表情符号和其他非ASCII字符应该格式化为Unicode。
从本质上讲,Unicode是一种通用的字符编码标准,所有语言中的每个字符和符号都被赋予一个代码。Unicode是必需的,因为允许使用各种不同的语言检索或连接数据。
注意:示例代码来自Python指南

  1. # 创建unicode字符串
  2. text_unicode = "Python is easy \u200c to learn"
  3. # 将文本编码为ASCII格式
  4. text_encode = text_unicode.encode(encoding="ascii", errors="ignore")
  5. # 解码文本
  6. text_decode = text_encode.decode()
  7. # 清除文本以删除额外的空白
  8. clean_text = " ".join([word for word in text_decode.split()])
  9. print(clean_text)
  10. >>>> Python is easy to learn.

删除url,标签,标点符号,提及,等等。

根据处理的数据类型,可能会面临各种各样的挑战。例如,如果在处理来自Twitter的数据,就会发现各种各样的标签和提及——用Twitter的行话来说,这是指一条包含另一个用户名的推文。
如果这些特征对试图解决的问题没有价值,那么最好从数据中删除它们。然而,由于在许多实例中不能依赖于已定义的字符,所以可以利用名为Regex的模式匹配工具的强大功能来帮助我们。

  1. import re
  2. # 删除提及
  3. text = "You should get @BlockFiZac from @BlockFi to talk about bitcoin lending, stablecoins, institution adoption, and the future of crypto"
  4. text = re.sub("@\S+", "", text)
  5. print(text)
  6. >>>> You should get from to talk about bitcoin lending, stablecoins, institution adoption, and the future of crypto
  7. ------------------------------------------------------------------
  8. # 删除记号
  9. text = """#BITCOIN LOVES MARCH 13th A year ago the price of Bitcoin collapsed to $3,800 one of the lowest levels in the last 4 years. Today, exactly one year later it reaches the new all-time high of $60,000 Thank you Bitcoin for always making my birthday exciting"""
  10. text = re.sub("\$", "", text)
  11. print(text)
  12. >>>> #BITCOIN LOVES MARCH 13th A year ago the price of Bitcoin collapsed to 3,800 one of the lowest levels in the last 4 years. Today, exactly one year later it reaches the new all-time high of 60,000 Thank you Bitcoin for always making my birthday exciting
  13. ------------------------------------------------------------------
  14. # 删除urls
  15. text = "Did someone just say “Feature Engineering”? https://buff.ly/3rRzL0s"
  16. text = re.sub("https?:\/\/.*[\r\n]*", "", text)
  17. print(text)
  18. >>>> Did someone just say Feature Engineering”?
  19. ------------------------------------------------------------------
  20. # 删除hashtags
  21. text = """.#FreedomofExpression which includes #FreedomToProtest should be the cornerstone of any democracy. I’m looking forward to speaking in the 2 day debate on the #PoliceCrackdownBill & explaining why I will be voting against it."""
  22. text = re.sub("#", "", text)
  23. print(text)
  24. >>>> .FreedomofExpression which includes FreedomToProtest should be the cornerstone of any democracy. Im looking forward to speaking in the 2 day debate on the PoliceCrackdownBill & explaining why I will be voting against it.
  25. ------------------------------------------------------------------
  26. # 删除标点符号
  27. import string
  28. text = "Thank you! Not making sense. Just adding, lots of random punctuation."
  29. punct = set(string.punctuation)
  30. text = "".join([ch for ch in tweet if ch not in punct])
  31. print(text)
  32. >>>> Thank you Not making sense Just adding lots of random punctuation

词干提取&词形还原

当在做一个NLP任务时,可能希望电脑能够理解“walked”、“walk”和“walking”都只是同一个单词的不同时态,否则它们就会被区别对待。
词干提取和词形还原都是用于NLP中规范化文本的技术——为了进一步简化这个定义,只是将一个单词简化到它的核心词根。
根据维基百科的定义:

  • 词干提取——在语言形态学和信息检索中,词干提取是将单词缩减为词干、词根或词根的过程。
  • 词形还原——引理化在语言学中是指将一个单词的各种形式还原初始形式,这样它们就可以作为一个单词来分析。

尽管这两种技术的定义非常相似,但它们减少单词的方法却截然不同,这意味着这两种技术的结果并不一致。

  1. import nltk
  2. from nltk.stem.porter import PorterStemmer
  3. from nltk.stem import WordNetLemmatizer
  4. words = ["walk", "walking", "walked", "walks", "ran", "run", "running", "runs"]
  5. -----------------------------------------------------------------
  6. # 词干提取
  7. stemmer = PorterStemmer()
  8. for word in words:
  9. print(word + " ---> " + stemmer.stem(word))
  10. >>>> walk ---> walk
  11. walking ---> walk
  12. walked ---> walk
  13. walks ---> walk
  14. ran ---> ran
  15. run ---> run
  16. running ---> run
  17. runs ---> run
  18. ------------------------------------------------------------------
  19. # 词形还原
  20. lemmatizer = WordNetLemmatizer()
  21. for word in words:
  22. print(word + " ---> " + lemmatizer.lemmatize(word))
  23. >>>> walk ---> walk
  24. walking ---> walking
  25. walked ---> walked
  26. walks ---> walk
  27. ran ---> ran
  28. run ---> run
  29. running ---> running
  30. runs ---> run