用pandas处理文本数据 - 《pandas中文社区公众号》

文本数据类型
StringDtype与objectdtyped对象的差异
字符串方法
- 笔记
拆分和替换字符串方法
- split返回列表
- replace可使用正则表达式：
  - 将每一个小写字母的单词倒过来显示
  - 使用重词组
字符串连接
提取字符串
- 提取每个主题中的第一个匹配（提取）
- extract方法接受带有至少一个捕获组的正则表达式。
提取每个主题中的所有匹配项 (extractall)
- 与extract（仅返回第一个匹配项）不同，
- extractall方法返回每个匹配项。
测试匹配或包含模式的字符串
- 笔记
创建指标变量

文本数据类型

pandas存储文本数据有两种方式：

object -dtype NumPy 数组
StringDtype 扩展类型

我们一般使用StringDtype来存储文本数据。

pd.Series(["a", "b", "c"]) 0 a 1 b 2 c dtype: object

要显式请求stringdtype，指定dtype参数

`pd.Series([“a”, “b”, “c”], dtype=”string”)

0 a
1 b
2 c
dtype: string`

pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
0 a 1 b 2 c dtype: string

或者astype在SeriesorDataFrame被创建之后

s = pd.Series(["a", "b", "c"])

`print(s)

0 a
1 b
2 c
dtype: object`

s.astype("string")

0 a 1 b 2 c dtype: string

当然，我们还可以使用StringDtype/“string”作为非字符串数据的 dtype，它将被转换为stringdtype：

s = pd.Series([“a”, 2, np.nan], dtype=”string”)

print(s)

0 a
1 2
2
dtype: string

type(s[1])
str

也快从现有的Pandas数据进行转换：

s1 = pd.Series([1, 2, np.nan], dtype=”Int64”)

print(s1)

0 1
1 2
2
dtype: Int64

s2 = s1.astype(“string”)

Print(s2)

0 1
1 2
2
dtype: string

type(s2[0])
str

StringDtype与objectdtyped对象的差异

下面是StringDtype对象行为与objectdtype对象不同的地方。

对于StringDtype，返回数字输出的字符串访问器方法将始终返回可为空的整数 dtype，而不是 int 或 float dtype，具体取决于 NA 空值是否存在。

返回布尔输出的方法将返回一个可为空的布尔数据类型。

s = pd.Series([“a”, None, “b”], dtype=”string”)

print（s）

0 a
1
2 b
dtype: string

s.str.count(“a”)

0 1
1
2 0
dtype: Int64

s.dropna().str.count(“a”)

0 1
2 0
dtype: Int64

两个输出都是Int64dtype。将其与 object-dtype 进行比较

s2 = pd.Series([“a”, None, “b”], dtype=”object”)

s2.str.count(“a”)

0 1.0
1 NaN
2 0.0
dtype: float64

s2.dropna().str.count(“a”)

0 1
2 0
dtype: int64

当存在 NA 空值时，输出 dtype 为 float64。对于返回布尔值的方法也是如此。

s.str.isdigit()

0 False
1
2 False
dtype: boolean

s.str.match(“a”)

0 True
1
2 False
dtype: boolean

一些字符串方法，例如Series.str.decode()不可用，StringArray因为StringArray只保存字符串，而不是字节。

在比较操作，arrays.StringArray并Series通过支持StringArray将返回一个对象有BooleanDtype，而不是一个boolD型对象。a 中的缺失值StringArray 将在比较操作中传播，而不是像一样总是比较不相等numpy.nan。

字符串方法

Series 和 Index 配备了一组字符串处理方法，可以方便地对数组的每个元素进行操作。也许最重要的是，这些方法会自动排除缺失/NA 值。这些是通过str属性访问的，并且通常具有与等效（标量）内置字符串方法匹配的名称：

s = pd.Series(
….: [“A”, “B”, “C”, “Aaba”, “Baca”, np.nan, “CABA”, “dog”, “cat”], dtype=”string”
….: )
….:

s.str.lower()

0 a
1 b
2 c
3 aaba
4 baca
5
6 caba
7 dog
8 cat
dtype: string

s.str.upper()

0 A
1 B
2 C
3 AABA
4 BACA
5
6 CABA
7 DOG
8 CAT
dtype: string

s.str.len()

0 1
1 1
2 1
3 4
4 4
5
6 4
7 3
8 3
dtype: Int64

idx = pd.Index([“ jack”, “jill “, “ jesse “, “frank”])

idx.str.strip()
Index([‘jack’, ‘jill’, ‘jesse’, ‘frank’], dtype=’object’)

idx.str.lstrip()
Index([‘jack’, ‘jill ‘, ‘jesse ‘, ‘frank’], dtype=’object’)

idx.str.rstrip()
Index([‘ jack’, ‘jill’, ‘ jesse’, ‘frank’], dtype=’object’)

Index 上的字符串方法对于清理或转换 DataFrame 列特别有用。例如，你有对列数据进行清洗的时候：

df = pd.DataFrame(
….: np.random.randn(3, 2), columns=[“ Column A “, “ Column B “], index=range(3)
….: )
….:

print（df）

Column A Column B
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215

由于df.columns是一个索引对象，我们可以使用.str访问器

df.columns.str.strip()
Index([‘Column A’, ‘Column B’], dtype=’object’)

df.columns.str.lower()
Index([‘ column a ‘, ‘ column b ‘], dtype=’object’)

然后可以根据需要使用这些字符串方法来清理列。在这里，我们将删除前导和尾随空格，将所有名称小写，并用下划线替换任何剩余的空格：

df.columns = df.columns.str.strip().str.lower().str.replace(“ “, “_”)

print（df）

column_a column_b
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215

笔记

如果我们的Series有很多元素重复的地方，则可以更快地将原始Series类型转换为其中一种类型 category，然后使用.str.或.dt.。处理速度快慢是因为，对于Seriesof type category，字符串操作是在categories上而不是在每个元素上完成的Series。

拆分和替换字符串方法

split返回列表

In [38]: s2 = pd.Series([“a_b_c”, “c_d_e”, np.nan, “f_g_h”], dtype=”string”)

In [39]: s2.str.split(“_”)
Out[39]:
0 [a, b, c]
1 [c, d, e]
2
3 [f, g, h]
dtype: object

可以使用get或[]符号访问拆分列表中的元素：

In [40]: s2.str.split(“_”).str.get(1)
Out[40]:
0 b
1 d
2
3 g
dtype: object

In [41]: s2.str.split(“_”).str[1]
Out[41]:
0 b
1 d
2
3 g
dtype: object

可以使用expand.

In [42]: s2.str.split(“_”, expand=True)
Out[42]:
0 1 2
0 a b c
1 c d e
2
3 f g h

当 original Serieshas 时StringDtype，输出列也StringDtype一样。

还可以限制拆分的数量：

In [43]: s2.str.split(“_”, expand=True, n=1)
Out[43]:
0 1
0 a b_c
1 c d_e
2
3 f g_h

rsplit类似于split除了它以相反的方向工作，即从字符串的末尾到字符串的开头：

In [44]: s2.str.rsplit(“_”, expand=True, n=1)
Out[44]:
0 1
0 a_b c
1 c_d e
2
3 f_g h

replace可使用正则表达式：

In [45]: s3 = pd.Series(
….: [“A”, “B”, “C”, “Aaba”, “Baca”, “”, np.nan, “CABA”, “dog”, “cat”],
….: dtype=”string”,
….: )
….:

In [46]: s3
Out[46]:
0 A
1 B
2 C
3 Aaba
4 Baca
5
6
7 CABA
8 dog
9 cat
dtype: string

In [47]: s3.str.replace(“^.a|dog”, “XX-XX “, case=False, regex=True)
Out[47]:
0 A
1 B
2 C
3 XX-XX ba
4 XX-XX ca
5
6
7 XX-XX BA
8 XX-XX
9 XX-XX t
dtype: string

如果你想对字符串进行字面替换（相当于str.replace()），您可以将可选regex参数设置为False，而不是对每个字符进行转义。在这种情况下，pat和都repl必须是字符串：

In [48]: dollars = pd.Series([“12”, “-$10”, “$10,000”], dtype=”string”)

这些相当于

In [49]: dollars.str.replace(r”-$”, “-“, regex=True)
Out[49]:
0 12
1 -10
2 $10,000
dtype: string

In [50]: dollars.str.replace(“-$”, “-“, regex=False)
Out[50]:
0 12
1 -10
2 $10,000
dtype: string

该replace方法还可以将可调用对象作为替换。每次pat使用时都会调用它re.sub()。callable 应该期望一个位置参数（一个正则表达式对象）并返回一个字符串。

将每一个小写字母的单词倒过来显示

In [51]: pat = r”[a-z]+”

In [52]: def repl(m):
….: return m.group(0)[::-1]
….:

In [53]: pd.Series([“foo 123”, “bar baz”, np.nan], dtype=”string”).str.replace(
….: pat, repl, regex=True
….: )
….:

Out[53]:
0 oof 123
1 rab zab
2
dtype: string

使用重词组

In [54]: pat = r”(?P\w+) (?P\w+) (?P\w+)”

In [55]: def repl(m):
….: return m.group(“two”).swapcase()
….:

In [56]: pd.Series([“Foo Bar Baz”, np.nan], dtype=”string”).str.replace(
….: pat, repl, regex=True
….: )
….:
Out[56]:
0 bAR
1
dtype: string

该replace方法还接受一个已编译的正则表达式对象re.compile()作为模式。所有标志都应包含在编译的正则表达式对象中。

In [57]: import re

In [58]: regex_pat = re.compile(r”^.a|dog”, flags=re.IGNORECASE)

In [59]: s3.str.replace(regex_pat, “XX-XX “, regex=True)
Out[59]:
0 A
1 B
2 C
3 XX-XX ba
4 XX-XX ca
5
6
7 XX-XX BA
8 XX-XX
9 XX-XX t
dtype: string

flags在replace使用编译的正则表达式对象调用时包含参数将引发ValueError.

In [60]: s3.str.replace(regex_pat, ‘XX-XX ‘, flags=re.IGNORECASE)

ValueError: case and flags cannot be set when pat is a compiled regex

字符串连接

有几种连接 a Seriesor 的方法Index，无论是与自身还是其他，都基于cat(), 和。

Index.str.cat.

将单个系列连接成一个字符串
a Series(或Index)的内容可以连接：

In [61]: s = pd.Series([“a”, “b”, “c”, “d”], dtype=”string”)

In [62]: s.str.cat(sep=”,”)
Out[62]: ‘a,b,c,d’

如果未指定，sep分隔符的关键字默认为空字符串，sep=’’：

In [63]: s.str.cat()
Out[63]: ‘abcd’

默认情况下，忽略缺失值。使用na_rep，他们可以得到一个表示：

In [64]: t = pd.Series([“a”, “b”, np.nan, “d”], dtype=”string”)

In [65]: t.str.cat(sep=”,”)
Out[65]: ‘a,b,d’

In [66]: t.str.cat(sep=”,”, na_rep=”-“)
Out[66]: ‘a,b,-,d’

将一个系列和类似列表的东西连接成一个系列
的第一个参数cat()可以是一个类似列表的对象，前提是它与调用Series（或Index）的长度相匹配。

In [67]: s.str.cat([“A”, “B”, “C”, “D”])
Out[67]:
0 aA
1 bB
2 cC
3 dD
dtype: string
除非 na_rep指定，否则任一侧的缺失值也会导致结果中的缺失值：

In [68]: s.str.cat(t)
Out[68]:
0 aa
1 bb
2
3 dd
dtype: string

In [69]: s.str.cat(t, na_rep=”-“)
Out[69]:
0 aa
1 bb
2 c-
3 dd
dtype: string

将一个系列和类似数组的东西连接成一个系列，参数others也可以是二维的。在这种情况下，数字或行必须与调用Series（或Index）的长度匹配。

In [70]: d = pd.concat([t, s], axis=1)

In [71]: s
Out[71]:
0 a
1 b
2 c
3 d
dtype: string

In [72]: d
Out[72]:
0 1
0 a a
1 b b
2 c
3 d d

In [73]: s.str.cat(d, na_rep=”-“)
Out[73]:
0 aaa
1 bbb
2 c-c
3 ddd
dtype: string

将一个系列和一个索引对象连接成一个系列，并对齐。

对于与 aSeries或的连接，DataFrame可以通过设置join-keyword在连接之前对齐索引。

In [74]: u = pd.Series([“b”, “d”, “a”, “c”], index=[1, 3, 0, 2], dtype=”string”)

In [75]: s
Out[75]:
0 a
1 b
2 c
3 d
dtype: string

In [76]: u
Out[76]:
1 b
3 d
0 a
2 c
dtype: string

In [77]: s.str.cat(u)
Out[77]:
0 aa
1 bb
2 cc
3 dd
dtype: string

In [78]: s.str.cat(u, join=”left”)
Out[78]:
0 aa
1 bb
2 cc
3 dd
dtype: string

通常的选项可用于join（）。特别是，对齐还意味着不同的长度不再需要重合。’left’, ‘outer’, ‘inner’, ‘right’

In [79]: v = pd.Series([“z”, “a”, “b”, “d”, “e”], index=[-1, 0, 1, 3, 4], dtype=”string”)

In [80]: s
Out[80]:
0 a
1 b
2 c
3 d
dtype: string

In [81]: v
Out[81]:
-1 z
0 a
1 b
3 d
4 e
dtype: string

In [82]: s.str.cat(v, join=”left”, na_rep=”-“)
Out[82]:
0 aa
1 bb
2 c-
3 dd
dtype: string

In [83]: s.str.cat(v, join=”outer”, na_rep=”-“)
Out[83]:
-1 -z
0 aa
1 bb
2 c-
3 dd
4 -e
dtype: string

当others是 a时，可以使用相同的对齐方式DataFrame：

In [84]: f = d.loc[[3, 2, 1, 0], :]

In [85]: s
Out[85]:
0 a
1 b
2 c
3 d
dtype: string

In [86]: f
Out[86]:
0 1
3 d d
2 c
1 b b
0 a a

In [87]: s.str.cat(f, join=”left”, na_rep=”-“)
Out[87]:
0 aaa
1 bbb
2 c-c
3 ddd
dtype: string

将一个series和许多对象连接成一个series

几个类似数组的项目（特别是：Series、Index和的一维变体np.ndarray）可以组合在一个类似列表的容器中（包括迭代器、dict视图等）。

In [88]: s
Out[88]:
0 a
1 b
2 c
3 d
dtype: string

In [89]: u
Out[89]:
1 b
3 d
0 a
2 c
dtype: string

In [90]: s.str.cat([u, u.to_numpy()], join=”left”)
Out[90]:
0 aab
1 bbd
2 cca
3 ddc
dtype: string

没有索引（例如，所有元素np.ndarray内传递的列表样在长度给调用必须匹配）Series（或Index），但Series并Index可以具有任意长度（只要对准不与禁用join=None：）

In [91]: v
Out[91]:
-1 z
0 a
1 b
3 d
4 e
dtype: string

In [92]: s.str.cat([v, u, u.to_numpy()], join=”outer”, na_rep=”-“)
Out[92]:
-1 -z—
0 aaab
1 bbbd
2 c-ca
3 dddc
4 -e—
dtype: string
如果join=’right’在others包含不同索引的类列表上使用，则这些索引的并集将用作最终连接的基础：

In [93]: u.loc[[3]]
Out[93]:
3 d
dtype: string

In [94]: v.loc[[-1, 0]]
Out[94]:
-1 z
0 a
dtype: string

In [95]: s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join=”right”, na_rep=”-“)
Out[95]:
-1 —z
0 a-a
3 dd-
dtype: string

用索引.str

你可以使用[]符号直接按位置位置索引。如果索引超出字符串的末尾，结果将是NaN.

In [96]: s = pd.Series(
….: [“A”, “B”, “C”, “Aaba”, “Baca”, np.nan, “CABA”, “dog”, “cat”], dtype=”string”
….: )
….:

In [97]: s.str[0]
Out[97]:
0 A
1 B
2 C
3 A
4 B
5
6 C
7 d
8 c
dtype: string

In [98]: s.str[1]
Out[98]:
0
1
2
3 a
4 a
5
6 A
7 o
8 a
dtype: string

提取字符串

提取每个主题中的第一个匹配（提取）

extract方法接受带有至少一个捕获组的正则表达式。

提取包含多个组的正则表达式会返回一个每组一列的 DataFrame。

In [99]: pd.Series(
….: [“a1”, “b2”, “c3”],
….: dtype=”string”,
….: ).str.extract(r”([ab])(\d)”, expand=False)
….:
Out[99]:
0 1
0 a 1
1 b 2
2

不匹配的元素返回填充为的行NaN。因此，一系列凌乱的字符串可以被“转换”成一个类似索引的系列或清理过的或更有用的字符串的数据帧，而无需get()访问元组或re.match对象。结果的 dtype 始终是 object，即使未找到匹配项并且结果仅包含NaN.

命名组如

In [100]: pd.Series([“a1”, “b2”, “c3”], dtype=”string”).str.extract(
…..: r”(?P[ab])(?P\d)”, expand=False
…..: )
…..:
Out[100]:
letter digit
0 a 1
1 b 2
2

和可选组，如

In [101]: pd.Series(
…..: [“a1”, “b2”, “3”],
…..: dtype=”string”,
…..: ).str.extract(r”([ab])?(\d)”, expand=False)
…..:
Out[101]:
0 1
0 a 1
1 b 2
2 3

也可以使用。请注意，正则表达式中的任何捕获组名称都将用于列名称；否则将使用捕获组编号。

提取一个包含一组的正则表达式会返回一个DataFrame 包含一列的 if expand=True。

In [102]: pd.Series([“a1”, “b2”, “c3”], dtype=”string”).str.extract(r”ab“, expand=True)
Out[102]:
0
0 1
1 2
2

如果，它返回一个系列expand=False。

In [103]: pd.Series([“a1”, “b2”, “c3”], dtype=”string”).str.extract(r”ab“, expand=False)
Out[103]:
0 1
1 2
2
dtype: string

调用Index带有一个捕获组的正则表达式会返回一个DataFrame带有一列的 if expand=True。

In [104]: s = pd.Series([“a1”, “b2”, “c3”], [“A11”, “B22”, “C33”], dtype=”string”)

In [105]: s
Out[105]:
A11 a1
B22 b2
C33 c3
dtype: string

In [106]: s.index.str.extract(“(?P[a-zA-Z])”, expand=True)
Out[106]:
letter
0 A
1 B
2 C
它返回一个Indexif expand=False。

In [107]: s.index.str.extract(“(?P[a-zA-Z])”, expand=False)
Out[107]: Index([‘A’, ‘B’, ‘C’], dtype=’object’, name=’letter’)

Index使用具有多个捕获组的正则表达式调用会返回DataFrameif expand=True。

In [108]: s.index.str.extract(“(?P[a-zA-Z])([0-9]+)”, expand=True)
Out[108]:
letter 1
0 A 11
1 B 22
2 C 33
它提出ValueError如果expand=False。

提取每个主题中的所有匹配项 (extractall)

与extract（仅返回第一个匹配项）不同，

In [109]: s = pd.Series([“a1a2”, “b1”, “c1”], index=[“A”, “B”, “C”], dtype=”string”)

In [110]: s
Out[110]:
A a1a2
B b1
C c1
dtype: string

In [111]: two_groups = “(?P[a-z])(?P[0-9])”

In [112]: s.str.extract(two_groups, expand=True)
Out[112]:
letter digit
A a 1
B b 1
C c 1

extractall方法返回每个匹配项。

extractall总是DataFrameaMultiIndex在其行上。的最后一级MultiIndex被命名match并指示主题中的顺序。

In [113]: s.str.extractall(two_groups)
Out[113]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 c 1
当系列中的每个主题字符串都恰好有一个匹配项时，

In [114]: s = pd.Series([“a3”, “b3”, “c2”], dtype=”string”)

In [115]: s
Out[115]:
0 a3
1 b3
2 c2
dtype: string

然后给出与相同的结果。extractall(pat).xs(0, level=’match’)extract(pat)

In [116]: extract_result = s.str.extract(two_groups, expand=True)

In [117]: extract_result
Out[117]:
letter digit
0 a 3
1 b 3
2 c 2

In [118]: extractall_result = s.str.extractall(two_groups)

In [119]: extractall_result
Out[119]:
letter digit
match
0 0 a 3
1 0 b 3
2 0 c 2

In [120]: extractall_result.xs(0, level=”match”)
Out[120]:
letter digit
0 a 3
1 b 3
2 c 2

Index还支持.str.extractall。它返回与具有默认索引（从 0 开始）的 aDataFrame相同的结果Series.str.extractall。

In [121]: pd.Index([“a1a2”, “b1”, “c1”]).str.extractall(two_groups)
Out[121]:
letter digit
match
0 0 a 1
1 a 2
1 0 b 1
2 0 c 1

In [122]: pd.Series([“a1a2”, “b1”, “c1”], dtype=”string”).str.extractall(two_groups)
Out[122]:
letter digit
match
0 0 a 1
1 a 2
1 0 b 1
2 0 c 1

测试匹配或包含模式的字符串

你可以检查元素是否包含模式：

In [123]: pattern = r”[0-9][a-z]”

In [124]: pd.Series(
…..: [“1”, “2”, “3a”, “3b”, “03c”, “4dx”],
…..: dtype=”string”,
…..: ).str.contains(pattern)
…..:
Out[124]:
0 False
1 False
2 True
3 True
4 True
5 True
dtype: boolean

或者元素是否匹配模式：

In [125]: pd.Series(
…..: [“1”, “2”, “3a”, “3b”, “03c”, “4dx”],
…..: dtype=”string”,
…..: ).str.match(pattern)
…..:
Out[125]:
0 False
1 False
2 True
3 True
4 False
5 True
dtype: boolean

In [126]: pd.Series(
…..: [“1”, “2”, “3a”, “3b”, “03c”, “4dx”],
…..: dtype=”string”,
…..: ).str.fullmatch(pattern)
…..:
Out[126]:
0 False
1 False
2 True
3 True
4 False
5 False
dtype: boolean

笔记

match、fullmatch、和的区别在于contains严格性： fullmatch测试整个字符串是否与正则表达式匹配； match测试从字符串的第一个字符开始的正则表达式是否匹配；并contains测试在字符串中的任何位置是否有正则表达式的匹配项。

re这三种匹配模式在包中对应的函数分别是 re.fullmatch、 re.match和 re.search。

方法，如match，fullmatch，contains，startswith，并 endswith采取额外的na参数，所以遗漏值可以被认为是真或假：

In [127]: s4 = pd.Series(
…..: [“A”, “B”, “C”, “Aaba”, “Baca”, np.nan, “CABA”, “dog”, “cat”], dtype=”string”
…..: )
…..:

In [128]: s4.str.contains(“A”, na=False)
Out[128]:
0 True
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
dtype: boolean

创建指标变量

你可以从字符串列中提取虚拟变量。例如，如果它们由 a 分隔’|’：

In [129]: s = pd.Series([“a”, “a|b”, np.nan, “a|c”], dtype=”string”)

In [130]: s.str.get_dummies(sep=”|”)
Out[130]:
a b c
0 1 0 0
1 1 1 0
2 0 0 0
3 1 0 1
StringIndex还支持get_dummies返回一个MultiIndex.

In [131]: idx = pd.Index([“a”, “a|b”, np.nan, “a|c”])

In [132]: idx.str.get_dummies(sep=”|”)
Out[132]:
MultiIndex([(1, 0, 0),
(1, 1, 0),
(0, 0, 0),
(1, 0, 1)],
names=[‘a’, ‘b’, ‘c’])