基础知识 - 字符集检测模块 chardet - 《python学习笔记》

Chardet：通用字符编码检测器
基本用法
获取网页编码的示例：
获取文本编码的示例：
逐步检测编码

Chardet：通用字符编码检测器

Python版本：需要Python 2.6,2.7或3.3+
文档地址：https://chardet.readthedocs.io/

检测字符集范围：

ASCII，UTF-8，UTF-16（2种变体），UTF-32（4种变体）
Big5，GB2312，EUC-TW，HZ-GB-2312，ISO-2022-CN（繁体中文和简体中文）
EUC-JP，SHIFT_JIS，CP932，ISO-2022-JP（日文）
EUC-KR，ISO-2022-KR（韩文）
KOI8-R，MacCyrillic，IBM855，IBM866，ISO-8859-5，windows-1251（西里尔文）
ISO-8859-5，windows-1251（保加利亚语）
ISO-8859-1，windows-1252（西欧语言）
ISO-8859-7，windows-1253（希腊语）
ISO-8859-8，windows-1255（视觉和逻辑希伯来语）
TIS-620（泰国语）

基本用法

chardet.detect(byte_str)

byte_str参数必须是字节类型（bytes）字符串,否则就会报如下错误：

TypeError: Expected object of type bytes or bytearray, got:

python有两种不同的字符串，一种存储文本，一种存储字节。对于文本python内部采用unicode存储，而字节字符串显示原始字节序列或者ASCII
python3，文本字符串类型（使用unicode数据存储）被命名为str，字节字符串类型命名为bytes。
一般情况下，实例化一个字符串会得到一个str对象，如果想得到bytes，那就在文本之前加上前缀b，或者encode一下。所以，str对象有一个encode方法，bytes对象有一个decode方法

import chardet
s = '编码'.encode()
print(chardet.detect(s))

获取网页编码的示例：

import requests
import chardet
urls = ['https://www.jb51.net', 'https://www.baidu.com/']
for url in urls:
    r = requests.get(url)
    print(url, chardet.detect(r.content))
output：
https://www.jb51.net {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
https://www.baidu.com/ {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

获取文本编码的示例：

import chardet
with open('strcoding.py','rb') as f:
    print(chardet.detect(f.read()))
# output:
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': ''}

逐步检测编码

对于简短的网页或者文本内容，我们可以按照上述的方式进行操作，但如果我的文本是以G为单位计算的,如何能快速的获取文本的字符集内容呢？我们可以使用chardet模块的逐步检测编码方式，下面我们来对比下两者的差距，我这里就不用G级的数据了，那伏天氏小说的11MB内容就已经很能说明问题了：

# 原始方法
import chardet
import time
t0 = time.process_time()
with open("伏天氏.txt",'rb') as f:
    print(chardet.detect(f.read()))
t1 = time.process_time()
print(t1-t0)
# output:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
105.3786755

# 逐步检索方法：
import time
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
t0 = time.process_time()
for line in open("伏天氏.txt", 'rb'):
    detector.feed(line)
    if detector.done:
        break
detector.close()
print(detector.result)
t1 = time.process_time()
print(t1 - t0)
# output:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
45.1466894

我们可以看到，原始的方法，我们需要将所有的文本全部读取后，一行行的检测，最终获取结果，但使用UniversalDetector的方式，进行逐行判断，当系统读取进度觉得可以确定字符集编码时，就不再往下继续检测，从而返回结果。大大缩短了检测的时间