3. 文本处理命令及三驾马车（awk,sed,grep）

3. 文本处理命令及三驾马车（awk,sed,grep）

1. cut命令

功能：cut remove sections from each line of files(从每行文件中删除段) 可以以列（字段）为单位处理数据

cut 默认以 \t （制表符）为分隔符
-d：指定字段分隔符（默认一个空格）
-f：指定要显示的字段。
-f 1,3：显示第一和第三个字段。-f 1-3：显示第一到第三个字段。

cut 主要的用途在于将“同一行里面的数据进行分解！”最常使用在分析一些数据或文字数据的时候！这是因为有时候我们会以某些字符当作分区的参数，然后来将数据加以切割，以取得我们所需要的数据。

head /etc/passwd
## root:x:0:0:root:/root:/bin/bash
## daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
## bin:x:2:2:bin:/bin:/usr/sbin/nologin
## sys:x:3:3:sys:/dev:/usr/sbin/nologin
## sync:x:4:65534:sync:/bin:/bin/sync
## games:x:5:60:games:/usr/games:/usr/sbin/nologin
## man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
## lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
## mail:x:8:8:mail:/var/mail:/usr/sbin/nologin
## news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
head /etc/passwd |cut -d ':' -f 3
## 0
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9

2. tr 命令

功能：转换或删除文件中的字符

示例：

head /etc/passwd |tr ':' '\t'            ## 把冒号替换成制表符\t
## root    x    0    0    root    /root    /bin/bash
## daemon    x    1    1    daemon    /usr/sbin    /usr/sbin/nologin
## bin    x    2    2    bin    /bin    /usr/sbin/nologin
## sys    x    3    3    sys    /dev    /usr/sbin/nologin
## sync    x    4    65534    sync    /bin    /bin/sync
## games    x    5    60    games    /usr/games    /usr/sbin/nologin
## man    x    6    12    man    /var/cache/man    /usr/sbin/nologin
## lp    x    7    7    lp    /var/spool/lpd    /usr/sbin/nologin
## mail    x    8    8    mail    /var/mail    /usr/sbin/nologin
## news    x    9    9    news    /var/spool/news    /usr/sbin/nologin
cat readme.txt | tr a-z A-Z              ## 把小写转成大写
## WELCOME TO BIOTRAINEE() !
## THIS IS YOUR PERSONAL ACCOUNT IN OUR CLOUD.
## HAVE A FUN WITH IT.
## PLEASE FEEL FREE TO CONTACT WITH ME( EMAIL TO JMZENG1314@163.COM )
## (HTTP://WWW.BIOTRAINEE.COM/THREAD-1376-1-1.HTML)

3. sort 命令

功能：sort lines of text files 对文件的数据进行排序，一行一行的排序（默认根据ASCII表升序排列）

参数说明：

-k 依照某一列（字段）来排序。
-n 依照数值的大小排序。
-r 以相反的顺序来排序。
-t <分隔字符> 指定排序时所用的列（字段）分隔字符。

示例：

head /etc/passwd |tr ':' '\t'|sort -n -k 4
## root    x    0    0    root    /root    /bin/bash
## daemon    x    1    1    daemon    /usr/sbin    /usr/sbin/nologin
## bin    x    2    2    bin    /bin    /usr/sbin/nologin
## sys    x    3    3    sys    /dev    /usr/sbin/nologin
## lp    x    7    7    lp    /var/spool/lpd    /usr/sbin/nologin
## mail    x    8    8    mail    /var/mail    /usr/sbin/nologin
## news    x    9    9    news    /var/spool/news    /usr/sbin/nologin
## man    x    6    12    man    /var/cache/man    /usr/sbin/nologin
## games    x    5    60    games    /usr/games    /usr/sbin/nologin
## sync    x    4    65534    sync    /bin    /bin/sync

4. uniq 命令

功能：report or omit repeated lines 检查及删除文本文件中重复出现的行列，一般与 sort 命令结合使用。常用参数是-c，统计有多少行重复

head /etc/passwd | tr ':' '\t' | cut -f 7 | sort |uniq -c
##       1 /bin/bash
##       1 /bin/sync
##       8 /usr/sbin/nologin

5. paste 命令

功能：合并文件的行或列。如：paste file1 file2

参数：

-d <间隔字符> 　用指定的间隔字符取代默认分隔符-制表符\t。
-s           　串列进行而非平行处理。

常见用法

head /etc/passwd |cut -d ':' -f 3 | paste - -    ## 把一列转置成两列
## 0    1
## 2    3
## 4    5
## 6    7
## 8    9

6. grep 命令

grep是一种强大的文本搜索工具，它能使用正则表达式搜索文本，并把匹配的行打印出来。grep 处理速度非常之快，尽量使用这个命令处理文本。模式是指直接输入要匹配的字符串，也可以使用正则表达式匹配，用法：

grep [参数] "模式" [文件]

参数：

-v     ## 排除匹配到的行，即输出没有匹配到的行
-w     ## 单词的完全匹配
-c     ## 对匹配到的行计数
-o     ## 只输出匹配到的部分
-B     ## 指定输出匹配到的前多少行
-A     ## 指定输出匹配到的后多少行
-E     ## 指定支持扩展表达式

用法：

cat readme.txt 
## Welcome to Biotrainee() !
## This is your personal account in our Cloud.
## Have a fun with it.
## Please feel free to contact with me( email to jmzeng1314@163.com )
## (http://www.biotrainee.com/thread-1376-1-1.html)
cat readme.txt |grep 'me'
## Welcome to Biotrainee() !
## Please feel free to contact with me( email to jmzeng1314@163.com )
cat readme.txt |grep -v 'me'
## This is your personal account in our Cloud.
## Have a fun with it.
## (http://www.biotrainee.com/thread-1376-1-1.html)
cat readme.txt |grep -w 'me'
## Please feel free to contact with me( email to jmzeng1314@163.com )

7. sed 命令

sed编辑器会执行下列操作：

(1) 一次从输入中读取一行数据。

(2) 根据所提供的编辑器命令匹配数据。

(3) 按照命令修改流中的数据。

(4) 将新的数据输出到STDOUT。

sed命令的格式如下。

sed  options  script  file

options（用得少，不需要记）：

-n∶使用安静(silent)模式。在一般 sed 的用法中，所有来自 STDIN的资料一般都会被列出到屏幕上。但如果加上 -n 参数后，则只有经过sed 特殊处理的那一行(或者动作)才会被列出来。
-e∶直接在指令列模式上进行 sed 的动作编辑；
-f∶直接将 sed 的动作写在一个档案内， -f filename 则可以执行 filename 内的sed 动作；
-r∶sed 的动作支援的是延伸型正规表示法的语法。(预设是基础正规表示法语法)
-i∶直接修改读取的档案内容，而不是由屏幕输出。

script：

a∶append，增加一行，a 的后面可以接字串，而这些字串会在新的一行出现(指定行的下一行)；
i∶insert，i的后面可以接字串，而这些字串会在新的一行出现(指定行的上一行)；
c∶取代，c 的后面可以接字串，这些字串可以取代 n1,n2 之间的行！
d∶删除，可以指定删除某一行或者某几行，也可以指定删除匹配上的行
p∶搭配-n参数，只输出匹配到的行
s∶替换，使用格式为 s/aaa/bbb/,把aaa替换成bbb

常见用法：

cat readme.txt 
## Welcome to Biotrainee() !
## This is your personal account in our Cloud.
## Have a fun with it.
## Please feel free to contact with me( email to jmzeng1314@163.com )
## (http://www.biotrainee.com/thread-1376-1-1.html)
cat readme.txt | sed 's/e/E/'     ## 在sed中执行 s 命令，即替换，将每一行第一个e替换为E
## WElcome to Biotrainee() !
## This is your pErsonal account in our Cloud.
## HavE a fun with it.
## PlEase feel free to contact with me( email to jmzeng1314@163.com )
## (http://www.biotrainEe.com/thread-1376-1-1.html)
cat readme.txt | sed 's/e/E/2'    ## 将每一行第2个e替换为E，末尾的2表示第2个
## WelcomE to Biotrainee() !
## This is your personal account in our Cloud.
## Have a fun with it.
## PleasE feel free to contact with me( email to jmzeng1314@163.com )
## (http://www.biotraineE.com/thread-1376-1-1.html)
cat readme.txt | sed 's/e/E/g'         ## 将每一行所有的e替换为E，末尾的g即为global的意思
## WElcomE to BiotrainEE() !
## This is your pErsonal account in our Cloud.
## HavE a fun with it.
## PlEasE fEEl frEE to contact with mE( Email to jmzEng1314@163.com )
## (http://www.biotrainEE.com/thrEad-1376-1-1.html)
cat readme.txt | sed '1s/e/E/g'        ## 只修改第一行## 
## WElcomE to BiotrainEE() !
## This is your personal account in our Cloud.
## Have a fun with it.
## Please feel free to contact with me( email to jmzeng1314@163.com )
## (http://www.biotrainee.com/thread-1376-1-1.html)
cat readme.txt | sed '1,3d'            ## 删除第1~3行
## Please feel free to contact with me( email to jmzeng1314@163.com )
## (http://www.biotrainee.com/thread-1376-1-1.html)
cat readme.txt | sed '/me/d'           ## 删除匹配上me的行
## This is your personal account in our Cloud.
## Have a fun with it.
## (http://www.biotrainee.com/thread-1376-1-1.html)
cat readme.txt | sed '1a Hello world'  ## 在第一行后面增加一行，‘Hello world’
## Welcome to Biotrainee() !
## Hello world
## This is your personal account in our Cloud.
## Have a fun with it.
## Please feel free to contact with me( email to jmzeng1314@163.com )
## (http://www.biotrainee.com/thread-1376-1-1.html)

8. awk 命令

awk 是文本处理的一把好手，它具备完整的命令操作和编程体系。手册可以参考http://man.linuxde.net/awk。首先要明白的是， awk 按行处理数据。在shell知识里，如果把一个文档看做一张表。那么一行就是一个记录，一列就是一个域或者说一个字段。awk 核心是它用$0表示所有列， $1 ， $2 …等等表示对应的列（默认列与列之间的分隔符为制表符\t）。

cat example.bed
## chr1 26 39
## chr1 32 47
## chr3 11 28
awk '{print $0}' example.bed             ## 输出所有列
## chr1 26 39
## chr1 32 47
## chr3 11 28
awk '{print $1}' example.bed             ## 输出第一列
## chr1
## chr1
## chr3

$ awk '{ print "test\t" $2 "\t" $3}' example.bed           
## test 26 39
## test 32 47
## test 11 28

awk作为一门编程语言，它支持各种操作符（变量，运算，逻辑，判断，函数等）。awk程序的基本格式如下：

awk options program file

options（ -F 较为常用）:

-F    fs            指定行中划分数据字段的字段分隔符
-f    file        从指定的文件中读取程序
-v    var=value     定义awk程序中的一个变量及其默认值
-mf    N             指定要处理的数据文件中的最大字段数
-mr    N             指定数据文件中的最大数据行数
-W    keyword     指定awk的兼容模式或警告等级

program：

awk中的执行的程序，包括变量，运算，逻辑，判断，函数等。

变量（变量FS和OFS定义了awk如何处理数据流中的数据字段）：

## 分隔符变量
FIELDWIDTHS            由空格分隔的一列数字，定义了每个数据字段确切宽度
FS                    输入字段分隔符
RS                    输入记录分隔符
OFS                    输出字段分隔符
ORS                    输出记录分隔符
## 数据变量
NF                     数据文件中的字段总数，即列数
NR                     已处理的输入记录数，即行数

head -5 /etc/passwd
## root:x:0:0:root:/root:/bin/bash
## daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
## bin:x:2:2:bin:/bin:/usr/sbin/nologin
## sys:x:3:3:sys:/dev:/usr/sbin/nologin
## sync:x:4:65534:sync:/bin:/bin/sync
head -5 /etc/passwd|awk -F ':' '{print $1}'     ## 指定以：作为列与列的分隔符，输出第一列
## root
## daemon
## bin
## sys
## sync
head -5 /etc/passwd|awk -F ':' '{$1="root";print $0}'   ## 将第一列全部改为root，然后输出全部列，默认输出分隔符OFS为空格
## root x 0 0 root /root /bin/bash
## root x 1 1 daemon /usr/sbin /usr/sbin/nologin
## root x 2 2 bin /bin /usr/sbin/nologin
## root x 3 3 sys /dev /usr/sbin/nologin
## root x 4 65534 sync /bin /bin/sync
head -5 /etc/passwd|awk -F ':' '{print NR "\t" NF }'    ## 输出第n行有n列
## 1    7
## 2    7
## 3    7
## 4    7
## 5    7

数学运算：加 +，减 -，乘 *，除 /，取余 %

逻辑判断：

x == y：值x等于y。
x <= y：值x小于等于y。
x <  y：值x小于y。
x >= y：值x大于等于y。
x >  y：值x大于y。
x != y：值x不等于y。

head -5 /etc/passwd|awk  '(NR%2==1){print NR,$0 }'  ## 输出奇数行
## 1 root:x:0:0:root:/root:/bin/bash
## 3 bin:x:2:2:bin:/bin:/usr/sbin/nologin
## 5 sync:x:4:65534:sync:/bin:/bin/syn

内置函数：

函数	描述
asort(s [,d])	将数组s按数据元素值排序。索引值会被替换成表示新的排序顺序的连续数字。另外，如果指定了d，则排序后的数组会存储在数组d中
asorti(s [,d])	将数组s按索引值排序。生成的数组会将索引值作为数据元素值，用连续数字索引来表明排序顺序。另外如果指定了d，排序后的数组会存储在数组d中
gensub(r, s, h [, t])	查找变量`$0` 或目标字符串t（如果提供了的话）来匹配正则表达式r。如果h是一个以g或G开头的字符串，就用s替换掉匹配的文本。如果h是一个数字，它表示要替换掉第h处r匹配的地方gsub(r, s [,t]) 查找变量`$0` 或目标字符串t（如果提供了的话）来匹配正则表达式r。如果找到了，就全部替换成字符串s
index(s, t)	返回字符串t在字符串s中的索引值，如果没找到的话返回0
length([s])	返回字符串s的长度；如果没有指定的话，返回$0的长度
match(s, r [,a])	返回字符串s中正则表达式r出现位置的索引。如果指定了数组a，它会存储s中匹配正则表达式的那部分
split(s, a [,r])	将s用FS字符或正则表达式r（如果指定了的话）分开放到数组a中。返回字段的总数
sprintf(format,variables)	用提供的format和variables返回一个类似于printf输出的字符串
sub(r, s ,t)	在变量$0或目标字符串t中查找正则表达式r的匹配。如果找到了，就用字符串s替换掉第一处匹配
substr(s, i ,n)	返回s中从索引值i开始的n个字符组成的子字符串。如果未提供n，则返回s剩下的部分
tolower(s)	将s中的所有字符转换成小写
toupper(s)	将s中的所有字符转换成大写

head -5 /etc/passwd|awk  '(length($0)<40){print NR,$0 }'      ## 输出字符串长度小于40的行
## 1 root:x:0:0:root:/root:/bin/bash
## 3 bin:x:2:2:bin:/bin:/usr/sbin/nologin
## 4 sys:x:3:3:sys:/dev:/usr/sbin/nologin
## 5 sync:x:4:65534:sync:/bin:/bin/sync
head -5 /etc/passwd|awk  '{print NR,length($0) }'             ## 输出每一行的长度
## 1 31
## 2 47
## 3 36
## 4 36
## 5 34
head -5 /etc/passwd|awk '{print NR,substr($0,5,10) }'         ## 输出$0从第5个字符开始的后10个字符
## 1 :x:0:0:roo
## 2 on:x:1:1:d
## 3 x:2:2:bin:
## 4 x:3:3:sys:
## 5 :x:4:65534
head -5 /etc/passwd|awk '{sub("root","00",$0);print NR,$0 }'  ## 把每行第一个root换成00
## 1 0000:x:0:0:root:/root:/bin/bash
## 2 daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
## 3 bin:x:2:2:bin:/bin:/usr/sbin/nologin
## 4 sys:x:3:3:sys:/dev:/usr/sbin/nologin
## 5 sync:x:4:65534:sync:/bin:/bin/sync
head -5 /etc/passwd|awk  '{sub("^....","0000",$0);print NR,$0 }'    ## 把每行的前4个字符换成0000，这里引进了通配符的概念，^表示行首，点号.表示任意字符
## 1 0000:x:0:0:root:/root:/bin/bash
## 2 0000on:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
## 3 0000x:2:2:bin:/bin:/usr/sbin/nologin
## 4 0000x:3:3:sys:/dev:/usr/sbin/nologin
## 5 0000:x:4:65534:sync:/bin:/bin/sync

awk的一个常见用法也值得学习，如：

head -5 /etc/passwd|awk 'BEGIN{cha=0} {cha=cha+length($0);print cha} END{print cha}'
## 31
## 78
## 114
## 150
## 184
## 184

这里在执行程序之前，先用BEGIN{cha=0}定义了一个cha变量，初始值为0，然后再迭代相加，加上每一行$0的长度，每加上一行就输出一次，最后END{print cha}再次输出cha最终的值