IO模型的介绍

IO模型基本分类

(1)Blocking I/O(同步阻塞IO,BIO):最常见也最传统IO模型,即代码语句按顺序执行若某一条语句执行需等待那么后面的代码会被阻塞,例如常见顺序步骤:读取文件、等待内核返回数据、拿到数据、处理输出
IO多路复用原理 - 图1
用户线程同多系统调用recvfrom函数向内核发起IO读文件操作,从用户态切换到内核态(application switch to kernel),后面的代码将被阻塞,用户线程则会处于等待状态。当内核将数据从磁盘加载到内核空间,并拷贝到用户空间后(kernel switch to application),用户线程再进行后续的数据处理。
缺点:
用在多线程高并发场景(例如10万并发),服务端与客户端一对一连接,对于server端来说,将大量消耗内存和CPU资源(用户态到内核态的上下文切换),并发能力受限。
(2)同步非阻塞IO(Non-blocking IO,NIO):默认创建的socket为阻塞型,将socket设置为NONBLOCK,业务流程则变为同步非阻塞IO
同步非阻塞IO是在同步阻塞IO的基础上,将socket设置为NONBLOCK。这样做用户线程可以在发起IO请求后可以立即返回,原理图如下:
IO多路复用原理 - 图2
如图,用户线程连续三次执行系统调用函数recvfrom,但内核还未准备好数据,由于采用非阻塞,故直接返回错误,errno被置为EWOULDBLOCK,当第四次调用的时候,内核数据已经拷贝到了用户空间,故可以读取数据,返回OK,接下来执行数据处理。
缺点:
1、整个IO请求虽然是非阻塞的,但是为了等到数据,需要不断的轮训,反复请求。如果有大量的连接,将会消耗大量的CPU资源和网络带宽。
2、虽然设定了时间间隔去轮询,但是还是会有延迟,因为每隔一小段时间去执行一次系统调用,而数据很可能在轮询的间隔时间内已经就绪,这会导致整体数据的吞吐量降低。
用户线程需要循环的去系统调用,尝试读取数据,直到读取成功后,才进行后续的处理,
(3)IO多路复用(IO Multiplexing ):即经典的Reactor设计模式,有时也称为异步阻塞IO,Java中的Selector和Linux中的epoll都是这种模型。
IO多路复用是指内核一旦发现进程指定的一个或者多个IO事件准备读取,它就通知该进程,以select为例原理图如下:
IO多路复用原理 - 图3
前面两种IO模型用户线程直接调用recvfrom来等待内核返回数据,而IO复用则通过调用select(还有poll或者epoll)系统方法,此时用户线程会阻塞在select语句处,等待内核copy数据到用户态,用户再收到内核返回可读的socket文件描述符。
用户线程可以实现一个线程内同时发起和处理多个socket的IO请求,用户线程注册多个socket,(对于内核就是文件描述符集合),然后不断地调用select读取被激活的socket 文件描述符。(在这里,select看起就像是用户态和内核态之间的一个代理)
(4)异步IO(Asynchronous IO):即经典的Proactor设计模式,也称为异步非阻塞IO

IO多路复用原理 - 图4
用户进程发起read操作后立即返回去做其他事,kernel收到asynchronous read后也立刻返回
在数据准备完成后,kernel将数据拷贝到用户内存,并发送信号给用户告知数据已完成,或者调用用户注册的回调函数。
同步与异步:
同步阻塞IO,同步非阻塞IO,IO多路复用,从理论上来说,都是属于同步IO,因为这三种IO模型中,IO的读写操作都是在IO事件发生以后,由应用程序来完成的。而异步IO模型则不同,对于异步IO而言,不论是IO是否阻塞,异步IO的读写操作总是立即返回,因为真正的读写操作已经交由内核进行处理。也就是说,同步IO模型要求用户程序自行执行IO操作(将数据从内核缓冲区读入到用户缓冲区,或者将数据从用户缓冲区写入到内核缓冲区),而异步IO则是由内核来执行IO操作(数据在用户缓冲区和内核缓冲区的拷贝是由内核在“后台”执行的)。可以这么理解:同步IO向应用程序通知的是IO就绪事件,异步IO向应用程序通知的是IO完成事件。

深入理解多路IO模型select、poll、epoll

Select

  1. #include <sys/select.h>
  2. int selectint nfds, fd_set * readfds, fd_set * writefds, fd_set * exceptfds, struct timeval * timeout);

函数参数解释,参考文章
nfds:
非负整数的变量,表示当前线程打开的所有件文件描述符集的总数,nfds=maxfdp+1,计算方法就是当前线程打开的最大文件描述符+1
readfds:
fd_set集合类型的指针变量,表示当前线程接收到内核返回的可读事件文件描述符集合(有数据到了这个状态称之为读事件),如果这个集合中有一个文件可读,内核给select返回一个大于0的值,表示有文件可读,如果没有可读的文件,则根据timeout参数再判断是否超时,若内核阻塞当前线程的时长超出timeout,select返回0,若发生错误返回负值。传入NULL值,表示不关心任何文件的读变化
writefds:
当前有多少个写事件(关心输出缓冲区是否已满)
最后一个结构体表示每个几秒钟醒来做其他事件,用来设置select等待时间
*exceptfds:
监视文件描述符集合中的有抛出异常的fd
readfds、writefds和exceptfds参数分别为可读、可写和异常等事件对应的文件描述符集合;程序只需要传入自己感兴趣的文件描述符,内核将修改他们来通知程序哪些文件描述符已经准备就绪;fd_set结构体仅包含一个整形数组,该数组的每一个元素的每一位标志一个文件描述符,下面的一组宏用来操作fd_set的每一位:

  1. FD_ZERO(fd_set * fdset);//清除fd_set的所有位
  2. FD_SET(int fd, fd_set * fdset);//设置fdset的位fd
  3. FD_CLR(int fd, fd_set * fdset);//清除fdset的位fd
  4. int FD_ISSET(int fd, fd_set * fdset);//测试fdset的位fd是否被设置

timeout:
select()的超时结束时间,它可以使select处于三种状态:
(1)若将NULL以形参传入,select置于阻塞状态,当前线程一直等到内核监视文件描述符集合中某个文件描述符发生变化为止;
(2)若将时间值设为0秒0毫秒,表示非阻塞,不管文件描述符是否有变化,都立刻返回继续执行,文件无变化返回0,有变化返回一个正值;
(3)timeout的值大于0,等待时长,即select在timeout时间内阻塞,超时后返回-1,否则在超时后不管怎样一定返回。
(4)若在select等待事件内程序接收到信号,则select立即返回-1,并设置errno为EINTER。

select运行机制:

select()的机制中提供一种fd_set的数据结构,实际上是一个long类型的数组,每一个数组元素都能与一打开的文件句柄(不管是Socket句柄,还是其他文件或命名管道或设备句柄)建立联系,建立联系的工作由程序员完成,当调用select()时,由内核根据IO状态修改fd_set的内容,由此来通知执行了select()的进程哪一Socket或文件可读。
从流程上来看,使用select函数进行IO请求和同步阻塞模型没有太大的区别,甚至还多了添加监视socket,以及调用select函数的额外操作,效率更差。但是,使用select以后最大的优势是用户可以在一个线程内同时处理多个socket的IO请求。用户可以注册多个socket,然后不断地调用select读取被激活的socket,即可达到在同一个线程内同时处理多个IO请求的目的。而在同步阻塞模型中,必须通过多线程的方式才能达到这个目的。

select调用过程:

  1. 使用copy_from_user从用户空间拷贝fd_set到内核空间
  2. 注册回调函数__pollwait
  3. 遍历所有fd,调用其对应的poll方法(对于socket,这个poll方法是sock_poll,sock_poll根据情况会调用到tcp_poll,udp_poll或者datagram_poll)
  4. 以tcp_poll为例,其核心实现就是__pollwait,也就是上面注册的回调函数。
  5. __pollwait的主要工作就是把current(当前进程)挂到设备的等待队列中,不同的设备有不同的等待队列,对于tcp_poll来说,其等待队列是sk->sk_sleep(注意把进程挂到等待队列中并不代表进程已经睡眠了)。在设备收到一条消息(网络设备)或填写完文件数据(磁盘设备)后,会唤醒设备等待队列上睡眠的进程,这时current便被唤醒了。
  6. poll方法返回时会返回一个描述读写操作是否就绪的mask掩码,根据这个mask掩码给fd_set赋值。
  7. 如果遍历完所有的fd,还没有返回一个可读写的mask掩码,则会调用schedule_timeout是调用select的进程(也就是current)进入睡眠。当设备驱动发生自身资源可读写后,会唤醒其等待队列上睡眠的进程。如果超过一定的超时时间(schedule_timeout指定),还是没人唤醒,则调用select的进程会重新被唤醒获得CPU,进而重新遍历fd,判断有没有就绪的fd。
  8. 把fd_set从内核空间拷贝到用户空间。

    select的优点:

    select目前几乎在所有的平台上支持,其良好跨平台支持。

    select缺点

  9. 每次调用select,都需要把fd_set集合从用户态拷贝到内核态,如果fd_set集合很大时,那这个开销也很大

  10. 同时每次调用select都需要在内核遍历传递进来的所有fd_set,如果fd_set集合很大时,那这个开销也很大
  11. 单个进程能够监视的文件描述符的数量存在最大限制,通常是1024,当然可以更改数量,但由于select采用轮询的方式扫描文件描述符,文件描述符数量越多,性能越差;(在linux内核头文件中,有这样的定义:#define __FD_SETSIZE 1024)
  12. 内核 / 用户空间内存拷贝问题,select需要复制大量的句柄数据结构,产生巨大的开销;
  13. select返回的是含有整个句柄的数组,应用程序需要遍历整个数组才能发现哪些句柄发生了事件;
  14. select的触发方式是水平触发,应用程序如果没有完成对一个已经就绪的文件描述符进行IO操作,那么之后每次select调用还是会将这些文件描述符通知进程。

    POLL

    poll的机制与select类似,与select在本质上没有多大差别,管理多个描述符也是进行轮询,根据描述符的状态进行处理。相比select模型,poll使用链表保存文件描述符,因此没有了最大文件描述符数量的限制。
    poll的函数接口如下所示: ```cpp int poll(struct pollfd *fds, nfds_t nfds, int timeout);

typedef struct pollfd { int fd; // 需要被检测或选择的文件描述符 short events; // 对文件描述符fd上感兴趣的事件 short revents; // 文件描述符fd上当前实际发生的事件 } pollfd_t;

  1. poll改变了文件描述符集合的数据结构,使用了pollfd结构而不是selectfd_set结构,使得poll支持的文件描述符集合限制远大于select1024。<br />struct pollfd *fds fds:是一个struct pollfd类型的数组,用于存放需要检测其状态的socket描述符,并且调用poll函数之后fds数组不会被清空;一个pollfd结构体表示一个被监视的文件描述符,通过传递fds指示 poll() 监视多个文件描述符。其中,结构体的events域是监视该文件描述符的事件掩码,由用户来设置这个域,结构体的revents域是文件描述符的操作结果事件掩码,内核在调用返回时设置这个域。并且内核每次修改的是revents整个成员,而events成员保持不变,故下次调用poll的时候不需要重置pollfd类型的事件集参数。<br />nfds_t nfds 记录数组fds中描述符的总数量。
  2. <a name="VJKki"></a>
  3. ### epoll
  4. epollLinux2.6内核正式提出,是基于事件驱动的I/O方式,相对于select来说,epoll没有描述符个数限制,使用一个文件描述符管理多个描述符,将用户关心的文件描述符的事件存放到内核的一个事件表中,这样在用户空间和内核空间的copy只需一次。<br />epoll提供了三个函数:
  5. ```cpp
  6. int epoll_create(int size);
  7. int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
  8. int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

epoll使用步骤:

  1. epoll_create 函数创建一个epoll句柄,参数size表明内核要监听的描述符数量。调用成功时返回一个epoll句柄描述符,失败时返回-1。
  2. epoll_ctl 函数注册要监听的事件类型。四个参数解释如下:
    • epfd 表示epoll句柄
    • op 表示fd操作类型,有如下3种
      • EPOLL_CTL_ADD 注册新的fd到epfd中
      • EPOLL_CTL_MOD 修改已注册的fd的监听事件
      • EPOLL_CTL_DEL 从epfd中删除一个fd
    • fd 是要监听的描述符
    • event 表示要监听的事件
      epoll_event 结构体定义如下: ```cpp struct epoll_event { __uint32_t events; / Epoll events / epoll_data_t data; / User data variable / };

typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t;

  1. epoll_wait 函数等待事件的就绪,成功时返回就绪的事件数目,调用失败时返回 -1,等待超时返回 0
  2. - epfd epoll句柄
  3. - events 表示从内核得到的就绪事件集合
  4. - maxevents 告诉内核events的大小
  5. - timeout 表示等待的超时事件
  6. epollLinux内核为处理大批量文件描述符而作了改进的poll,是Linux下多路复用IO接口select/poll的增强版本,它能显著提高程序在大量并发连接中只有少量活跃的情况下的系统CPU利用率。原因就是获取事件的时候,它无须遍历整个被侦听的描述符集,只要遍历那些被内核IO事件异步唤醒而加入Ready队列的描述符集合就行了。<br />epoll除了提供select/poll那种IO事件的水平触发(Level Triggered)外,还提供了边缘触发(Edge Triggered),这就使得用户空间程序有可能缓存IO状态,减少epoll_wait/epoll_pwait的调用,提高应用程序效率。
  7. - **水平触发(LT)**:默认工作模式,即当epoll_wait检测到某描述符事件就绪并通知应用程序时,应用程序可以不立即处理该事件;下次调用epoll_wait时,会再次通知此事件
  8. - **边缘触发(ET)**: epoll_wait检测到某描述符事件就绪并通知应用程序时,应用程序必须立即处理该事件。如果不处理,下次调用epoll_wait时,不会再次通知此事件。(直到你做了某些操作导致该描述符变成未就绪状态了,也就是说边缘触发只在状态由未就绪变为就绪时只通知一次)。
  9. <a name="m7wap"></a>
  10. #### epoll原理
  11. 当某一进程调用epoll_create方法时,Linux内核会创建一个eventpoll结构体,eventpoll结构体如下所示:
  12. ```cpp
  13. // 每创建一个epollfd,内核就会分配一个eventpoll与之对应,可以理解成内核态的epollfd.
  14. struct eventpoll {
  15. /* Protect the access to this structure */
  16. spinlock_t lock;
  17. /*
  18. * This mutex is used to ensure that files are not removed
  19. * while epoll is using them. This is held during the event
  20. * collection loop, the file cleanup path, the epoll file exit
  21. * code and the ctl operations.
  22. */
  23. /**
  24. 添加,修改或删除监听fd的时候,以及epoll_wait返回,向用户空间传递数据时,都会持有这个互斥锁.
  25. 因此,在用户空间中执行epoll相关操作是线程安全的,内核已经做了保护.
  26. **/
  27. struct mutex mtx;
  28. /* Wait queue used by sys_epoll_wait() */
  29. /**
  30. 等待队列头部.
  31. 当在该等待队列中的进程调用epoll_wait()时,会进入睡眠.
  32. **/
  33. wait_queue_head_t wq;
  34. /* Wait queue used by file->poll() */
  35. /**
  36. 用于epollfd被f_op->poll()的时候
  37. **/
  38. wait_queue_head_t poll_wait;
  39. /* List of ready file descriptors */
  40. /**
  41. 所有已经ready的epitem被存放在这个链表里
  42. **/
  43. struct list_head rdllist;
  44. /* RB tree root used to store monitored fd structs */
  45. /**
  46. 所有待监听的epitem被存放在这个红黑树里
  47. **/
  48. struct rb_root rbr;
  49. /*
  50. * This is a single linked list that chains all the "struct epitem" that
  51. * happened while transferring ready events to userspace w/out
  52. * holding ->lock.
  53. */
  54. /**
  55. 当event转移到用户空间时,这个单链表存放着所有struct epitem
  56. **/
  57. struct epitem *ovflist;
  58. /* wakeup_source used when ep_scan_ready_list is running */
  59. struct wakeup_source *ws; // TODO
  60. /* The user that created the eventpoll descriptor */
  61. /**
  62. 这里存放了一些用户变量,比如fd监听数量的最大值等
  63. **/
  64. struct user_struct *user;
  65. struct file *file;
  66. /* used to optimize loop detection check */ // TODO
  67. int visited;
  68. struct list_head visited_list_link;
  69. };

这个结构体中有两个成员与epoll的使用方式密切相关,一个代表红黑树的根节点,一个是双向链表的头指针。每一个epoll对象都有一个独立的eventpoll结构体,用于存放通过epoll_ctl方法向epoll对象中添加进来的事件。这些事件都会挂载在红黑树中,如此,重复添加的事件就可以通过红黑树而高效的识别出来。而所有添加到epoll中的事件都会与设备(网卡)驱动程序建立回调关系,也就是说,当相应的事件发生时会调用这个回调方法。这个回调方法在内核中叫ep_poll_callback,它会将就绪的事件添加到rdlist双链表中。

  1. /**
  2. 所有已经ready的epitem被存放在这个链表里
  3. **/
  4. struct list_head rdllist;
  5. /* RB tree root used to store monitored fd structs */
  6. /**
  7. 所有待监听的epitem被存放在这个红黑树里
  8. **/
  9. struct rb_root rbr;
  10. // epitem表示一个被监听的fd
  11. struct epitem {
  12. union {
  13. /* RB tree node links this structure to the eventpoll RB tree */
  14. /**
  15. 红黑树结点,当使用epoll_ctl()将一批fd加入到某个epollfd时,内核会分配一批epitem与fd一一对应,
  16. 并且以红黑树的形式来组织它们,tree的root被存放在struct eventpoll中.
  17. **/
  18. struct rb_node rbn;
  19. /* Used to free the struct epitem */
  20. struct rcu_head rcu; // TODO
  21. };
  22. /* List header used to link this structure to the eventpoll ready list */
  23. /**
  24. 链表结点,所有已经ready的epitem都会被存放在eventpoll的rdllist链表中.
  25. **/
  26. struct list_head rdllink;
  27. /*
  28. * Works together "struct eventpoll"->ovflist in keeping the
  29. * single linked chain of items.
  30. */
  31. struct epitem *next; // 用于eventpoll的ovflist
  32. /* The file descriptor information this item refers to */
  33. /**
  34. epitem对应的fd和struct file
  35. **/
  36. struct epoll_filefd ffd;
  37. /* Number of active wait queue attached to poll operations */
  38. int nwait; // 当前epitem被加入到多少个等待队列中
  39. /* List containing poll wait queues */
  40. struct list_head pwqlist;
  41. /* The "container" of this item */
  42. /**
  43. 当前epitem属于那个eventpoll
  44. **/
  45. struct eventpoll *ep;
  46. /* List header used to link this item to the "struct file" items list */
  47. struct list_head fllink;
  48. /* wakeup_source used when EPOLLWAKEUP is set */
  49. struct wakeup_source __rcu *ws;
  50. /* The structure that describe the interested events and the source fd */
  51. /**
  52. 当前epitem关心哪些event,这个数据是由执行epoll_ctl时从用户空间传递过来的
  53. **/
  54. struct epoll_event event;
  55. };

在epoll中,对于每一个事件,都会建立一个epitem结构体,epitem中的成员与红黑树及双向链表的关系如下图所示。

IO多路复用原理 - 图5

epoll底层大致有三个步骤:

  1. 调用epoll_create()建立一个epoll对象(建立红黑树及双向链表的数据结构)
  2. 调用epoll_ctl将需要监听的事件插入到红黑树中,并注册对应的回调函数
  3. 当有事件就绪,就会调用回调函数,将就绪事件拷贝到就绪链表中去(零拷贝)。
  4. 调用epoll_wait检查就绪链表中是否有元素,并返回就绪链表中的数据。

    epoll特点:

    1.每次累加添加,不需要每次传入全部的监测fd。
    2.每个fd只将本进程挂载到自己的等待队列一次,直到该fd被从epoll移除,不需要重复挂载。
    3.fd事件回调函数是ep_epoll_callback,该函数将发生事件的fd加入到epoll专门的就绪队列rdllist中,同时唤醒本进程。
    4.本进程不需要遍历每一个fd去监测事件是否发生,而只需要判断epoll中的就绪队列rdllist是否为空即可。
    5.epoll返回时,只返回就绪队列rdllist中的项,避免了无关项的操作,应用层也就不需要再次重复遍历。
    6.epoll内部使用红黑树存储监测fd,支持大量fd的快速查询、修改和删除操作。

select、poll、epoll对比

select poll epoll
操作方式 用户通过三个参数分别传入感兴趣的可读、可写、异常等事件,内核通过对这些参数修改来反馈其中就绪的事件。这使得用户每次调用都要重置这三个参数。 统一处理所有事件类型,因此只需要一个事件集参数。用户通过pollfd.events传入感兴趣的事件,内核通过修改pollfd.revents参数反馈所有的(就绪未就绪)事件集。 内核通过一个事件表直接管理用户感兴趣的所有事件。因此每次调用epoll_wait时,无需反复传入用户感兴趣的事件。epoll_wait系统调用的参数events仅用来反馈就绪的事件集。
底层实现 数组 链表 红黑树、双向链表
IO效率 每次调用都进行线性遍历,时间复杂度为O(n) 每次调用都进行线性遍历,时间复杂度为O(n) 事件通知方式,每当fd就绪,系统注册的回调函数就会被调用,将就绪fd放到readyList里面,时间复杂度O(1)
最大连接数 1024(x86)或2048(x64) 无上限 无上限
fd拷贝 每次调用select,都需要把fd集合从用户态拷贝到内核态 每次调用poll,都需要把fd集合从用户态拷贝到内核态 调用epoll_ctl时拷贝进内核并保存,之后每次epoll_wait不拷贝

附录

epoll实现过程:

摘自:https://www.cnblogs.com/leohotfn/p/8516370.html
一. epoll_create

  1. 调用ep_alloc()来创建一个struct eventpoll对象.ep_alloc()的执行过程如下:
    1a. 获取当前用户的一些信息.
    1b. 分配一个struct eventpoll对象.
    1c. 初始化相关数据成员,如等待队列,就绪链表,红黑树.
  2. 创建一个匿名fd和与之对应的struct file对象.
  3. 将该eventpoll和该file关联起来,eventpoll对象保存在file对象的private_data指针中.

  二. epoll_ctl

  1. 将event拷贝到内核空间.
  2. 判断加入的fd是否支持poll操作.
  3. 根据用户传入的op参数,以及是否在eventpoll的红黑树中找到该fd的结点,来执行相应的操作(插入,删除,修改).拿插入举例,执行ep_insert():
    3a. 在slab缓存中分配一个epitem对象,并初始化相关数据成员,如保存待监听的fd和它的file结构.
    3b. 指定调用poll_wait()时(再次强调,不是epoll_wait)时的回调函数,用于数据就绪时唤醒进程.(其实质是初始化文件的等待队列,将进程加入到等待队列).
    3c. 到此该epitem就和这个待监听的fd关联起来了.
    3d. 将该epitem插入到eventpoll的红黑树中.

  三. epoll_wait

  1. 调用ep_poll():
    1a. 计算睡眠时间(如果有).
    1b. 判断eventpoll的就绪链表是否为空,不为空则直接处理而不是睡眠.
    1c. 将当前进程添加到eventpoll的等待队列中.
    1d. 进入循环.
    1e. 将当前进程设置成TASK_INTERRUPTIBLE状态,然后判断是否有信号到来,如果没有,则进入睡眠.
    1f. 如果超时或被信号唤醒,则跳出循环.
    1g. 将当前进程从等待队列中删除,并把其状态设置成TASK_RUNNING.
    1h. 将数据拷贝给用户空间.拷贝的过程是先把ready list转移到中间链表,然后遍历中间链表拷贝到用户空间,并且判断每个结点是否水平触发,是则再次插入
    到ready list.

附录
epoll源码

  1. // @ fs/eventpoll.c
  2. /*
  3. * This structure is stored inside the "private_data" member of the file
  4. * structure and represents the main data structure for the eventpoll
  5. * interface.
  6. */
  7. // 每创建一个epollfd,内核就会分配一个eventpoll与之对应,可以理解成内核态的epollfd.
  8. struct eventpoll {
  9. /* Protect the access to this structure */
  10. spinlock_t lock;
  11. /*
  12. * This mutex is used to ensure that files are not removed
  13. * while epoll is using them. This is held during the event
  14. * collection loop, the file cleanup path, the epoll file exit
  15. * code and the ctl operations.
  16. */
  17. /**
  18. 添加,修改或删除监听fd的时候,以及epoll_wait返回,向用户空间传递数据时,都会持有这个互斥锁.
  19. 因此,在用户空间中执行epoll相关操作是线程安全的,内核已经做了保护.
  20. **/
  21. struct mutex mtx;
  22. /* Wait queue used by sys_epoll_wait() */
  23. /**
  24. 等待队列头部.
  25. 当在该等待队列中的进程调用epoll_wait()时,会进入睡眠.
  26. **/
  27. wait_queue_head_t wq;
  28. /* Wait queue used by file->poll() */
  29. /**
  30. 用于epollfd被f_op->poll()的时候
  31. **/
  32. wait_queue_head_t poll_wait;
  33. /* List of ready file descriptors */
  34. /**
  35. 所有已经ready的epitem被存放在这个链表里
  36. **/
  37. struct list_head rdllist;
  38. /* RB tree root used to store monitored fd structs */
  39. /**
  40. 所有待监听的epitem被存放在这个红黑树里
  41. **/
  42. struct rb_root rbr;
  43. /*
  44. * This is a single linked list that chains all the "struct epitem" that
  45. * happened while transferring ready events to userspace w/out
  46. * holding ->lock.
  47. */
  48. /**
  49. 当event转移到用户空间时,这个单链表存放着所有struct epitem
  50. **/
  51. struct epitem *ovflist;
  52. /* wakeup_source used when ep_scan_ready_list is running */
  53. struct wakeup_source *ws; // TODO
  54. /* The user that created the eventpoll descriptor */
  55. /**
  56. 这里存放了一些用户变量,比如fd监听数量的最大值等
  57. **/
  58. struct user_struct *user;
  59. struct file *file;
  60. /* used to optimize loop detection check */ // TODO
  61. int visited;
  62. struct list_head visited_list_link;
  63. };
  64. /*
  65. * Each file descriptor added to the eventpoll interface will
  66. * have an entry of this type linked to the "rbr" RB tree.
  67. * Avoid increasing the size of this struct, there can be many thousands
  68. * of these on a server and we do not want this to take another cache line.
  69. */
  70. // epitem表示一个被监听的fd
  71. struct epitem {
  72. union {
  73. /* RB tree node links this structure to the eventpoll RB tree */
  74. /**
  75. 红黑树结点,当使用epoll_ctl()将一批fd加入到某个epollfd时,内核会分配一批epitem与fd一一对应,
  76. 并且以红黑树的形式来组织它们,tree的root被存放在struct eventpoll中.
  77. **/
  78. struct rb_node rbn;
  79. /* Used to free the struct epitem */
  80. struct rcu_head rcu; // TODO
  81. };
  82. /* List header used to link this structure to the eventpoll ready list */
  83. /**
  84. 链表结点,所有已经ready的epitem都会被存放在eventpoll的rdllist链表中.
  85. **/
  86. struct list_head rdllink;
  87. /*
  88. * Works together "struct eventpoll"->ovflist in keeping the
  89. * single linked chain of items.
  90. */
  91. struct epitem *next; // 用于eventpoll的ovflist
  92. /* The file descriptor information this item refers to */
  93. /**
  94. epitem对应的fd和struct file
  95. **/
  96. struct epoll_filefd ffd;
  97. /* Number of active wait queue attached to poll operations */
  98. int nwait; // 当前epitem被加入到多少个等待队列中
  99. /* List containing poll wait queues */
  100. struct list_head pwqlist;
  101. /* The "container" of this item */
  102. /**
  103. 当前epitem属于那个eventpoll
  104. **/
  105. struct eventpoll *ep;
  106. /* List header used to link this item to the "struct file" items list */
  107. struct list_head fllink;
  108. /* wakeup_source used when EPOLLWAKEUP is set */
  109. struct wakeup_source __rcu *ws;
  110. /* The structure that describe the interested events and the source fd */
  111. /**
  112. 当前epitem关心哪些event,这个数据是由执行epoll_ctl时从用户空间传递过来的
  113. **/
  114. struct epoll_event event;
  115. };
  116. struct epoll_filefd {
  117. struct file *file;
  118. int fd;
  119. } __packed;
  120. /* Wait structure used by the poll hooks */
  121. struct eppoll_entry {
  122. /* List header used to link this structure to the "struct epitem" */
  123. struct list_head llink;
  124. /* The "base" pointer is set to the container "struct epitem" */
  125. struct epitem *base;
  126. /*
  127. * Wait queue item that will be linked to the target file wait
  128. * queue head.
  129. */
  130. wait_queue_t wait;
  131. /* The wait queue head that linked the "wait" wait queue item */
  132. wait_queue_head_t *whead;
  133. };
  134. /* Used by the ep_send_events() function as callback private data */
  135. struct ep_send_events_data {
  136. int maxevents;
  137. struct epoll_event __user *events;
  138. };
  139. /**
  140. 调用epoll_create()的实质,就是调用epoll_create1().
  141. **/
  142. SYSCALL_DEFINE1(epoll_create, int, size)
  143. {
  144. if (size <= 0)
  145. return -EINVAL;
  146. return sys_epoll_create1(0);
  147. }
  148. /*
  149. * Open an eventpoll file descriptor.
  150. */
  151. SYSCALL_DEFINE1(epoll_create1, int, flags)
  152. {
  153. int error, fd;
  154. struct eventpoll *ep = NULL;
  155. struct file *file;
  156. /* Check the EPOLL_* constant for consistency. */
  157. BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
  158. /**
  159. 对于epoll来说,目前唯一有效的FLAG是CLOSEXEC
  160. **/
  161. if (flags & ~EPOLL_CLOEXEC)
  162. return -EINVAL;
  163. /*
  164. * Create the internal data structure ("struct eventpoll").
  165. */
  166. /**
  167. 分配一个struct eventpoll,ep_alloc()的具体分析在下面
  168. **/
  169. error = ep_alloc(&ep);
  170. if (error < 0)
  171. return error;
  172. /*
  173. * Creates all the items needed to setup an eventpoll file. That is,
  174. * a file structure and a free file descriptor.
  175. */
  176. fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC)); // TODO
  177. if (fd < 0) {
  178. error = fd;
  179. goto out_free_ep;
  180. }
  181. /**
  182. 创建一个匿名fd.
  183. epollfd本身并不存在一个真正的文件与之对应,所以内核需要创建一个"虚拟"的文件,并为之分配
  184. 真正的struct file结构,并且具有真正的fd.
  185. **/
  186. file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
  187. O_RDWR | (flags & O_CLOEXEC));
  188. if (IS_ERR(file)) {
  189. error = PTR_ERR(file);
  190. goto out_free_fd;
  191. }
  192. ep->file = file;
  193. fd_install(fd, file);
  194. return fd;
  195. out_free_fd:
  196. put_unused_fd(fd);
  197. out_free_ep:
  198. ep_free(ep);
  199. return error;
  200. }
  201. /**
  202. 分配一个eventpoll结构
  203. **/
  204. static int ep_alloc(struct eventpoll **pep)
  205. {
  206. int error;
  207. struct user_struct *user;
  208. struct eventpoll *ep;
  209. /**
  210. 获取当前用户的一些信息,比如最大监听fd数目
  211. **/
  212. user = get_current_user();
  213. error = -ENOMEM;
  214. ep = kzalloc(sizeof(*ep), GFP_KERNEL); // 话说分配eventpoll对象是使用slab还是用buddy呢?TODO
  215. if (unlikely(!ep))
  216. goto free_uid;
  217. /**
  218. 初始化
  219. **/
  220. spin_lock_init(&ep->lock);
  221. mutex_init(&ep->mtx);
  222. init_waitqueue_head(&ep->wq);
  223. init_waitqueue_head(&ep->poll_wait);
  224. INIT_LIST_HEAD(&ep->rdllist);
  225. ep->rbr = RB_ROOT;
  226. ep->ovflist = EP_UNACTIVE_PTR;
  227. ep->user = user;
  228. *pep = ep;
  229. return 0;
  230. free_uid:
  231. free_uid(user);
  232. return error;
  233. }
  234. /*
  235. * The following function implements the controller interface for
  236. * the eventpoll file that enables the insertion/removal/change of
  237. * file descriptors inside the interest set.
  238. */
  239. /**
  240. 调用epool_ctl来添加要监听的fd.
  241. 参数说明:
  242. epfd,即epollfd
  243. op,操作,ADD,MOD,DEL
  244. fd,需要监听的文件描述符
  245. event,关心的events
  246. **/
  247. SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
  248. struct epoll_event __user *, event)
  249. {
  250. int error;
  251. int full_check = 0;
  252. struct fd f, tf;
  253. struct eventpoll *ep;
  254. struct epitem *epi;
  255. struct epoll_event epds;
  256. struct eventpoll *tep = NULL;
  257. error = -EFAULT;
  258. /**
  259. 错误处理以及
  260. 将event从用户空间拷贝到内核空间.
  261. **/
  262. if (ep_op_has_event(op) &&
  263. copy_from_user(&epds, event, sizeof(struct epoll_event)))
  264. goto error_return;
  265. error = -EBADF;
  266. /**
  267. 获取epollfd的file结构,该结构在epoll_create1()中,由函数anon_inode_getfile()分配
  268. **/
  269. f = fdget(epfd);
  270. if (!f.file)
  271. goto error_return;
  272. /* Get the "struct file *" for the target file */
  273. /**
  274. 获取待监听的fd的file结构
  275. **/
  276. tf = fdget(fd);
  277. if (!tf.file)
  278. goto error_fput;
  279. /* The target file descriptor must support poll */
  280. error = -EPERM;
  281. /**
  282. 待监听的文件一定要支持poll.
  283. 话说什么情况下文件不支持poll呢?TODO
  284. **/
  285. if (!tf.file->f_op->poll)
  286. goto error_tgt_fput;
  287. /* Check if EPOLLWAKEUP is allowed */
  288. if (ep_op_has_event(op))
  289. ep_take_care_of_epollwakeup(&epds);
  290. /*
  291. * We have to check that the file structure underneath the file descriptor
  292. * the user passed to us _is_ an eventpoll file. And also we do not permit
  293. * adding an epoll file descriptor inside itself.
  294. */
  295. error = -EINVAL;
  296. /**
  297. epollfd不能监听自己
  298. **/
  299. if (f.file == tf.file || !is_file_epoll(f.file))
  300. goto error_tgt_fput;
  301. /*
  302. * At this point it is safe to assume that the "private_data" contains
  303. * our own data structure.
  304. */
  305. /**
  306. 获取eventpoll结构,来自于epoll_create1()的分配
  307. **/
  308. ep = f.file->private_data;
  309. /*
  310. * When we insert an epoll file descriptor, inside another epoll file
  311. * descriptor, there is the change of creating closed loops, which are
  312. * better be handled here, than in more critical paths. While we are
  313. * checking for loops we also determine the list of files reachable
  314. * and hang them on the tfile_check_list, so we can check that we
  315. * haven't created too many possible wakeup paths.
  316. *
  317. * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when
  318. * the epoll file descriptor is attaching directly to a wakeup source,
  319. * unless the epoll file descriptor is nested. The purpose of taking the
  320. * 'epmutex' on add is to prevent complex toplogies such as loops and
  321. * deep wakeup paths from forming in parallel through multiple
  322. * EPOLL_CTL_ADD operations.
  323. */
  324. /**
  325. 以下操作可能会修改数据结构内容,锁
  326. **/
  327. // TODO
  328. mutex_lock_nested(&ep->mtx, 0);
  329. if (op == EPOLL_CTL_ADD) {
  330. if (!list_empty(&f.file->f_ep_links) ||
  331. is_file_epoll(tf.file)) {
  332. full_check = 1;
  333. mutex_unlock(&ep->mtx);
  334. mutex_lock(&epmutex);
  335. if (is_file_epoll(tf.file)) {
  336. error = -ELOOP;
  337. if (ep_loop_check(ep, tf.file) != 0) {
  338. clear_tfile_check_list();
  339. goto error_tgt_fput;
  340. }
  341. } else
  342. list_add(&tf.file->f_tfile_llink,
  343. &tfile_check_list);
  344. mutex_lock_nested(&ep->mtx, 0);
  345. if (is_file_epoll(tf.file)) {
  346. tep = tf.file->private_data;
  347. mutex_lock_nested(&tep->mtx, 1);
  348. }
  349. }
  350. }
  351. /*
  352. * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
  353. * above, we can be sure to be able to use the item looked up by
  354. * ep_find() till we release the mutex.
  355. */
  356. /**
  357. 对于每一个监听的fd,内核都有分配一个epitem结构,并且不允许重复分配,所以要查找该fd是否
  358. 已经存在.
  359. ep_find()即在红黑树中查找,时间复杂度为O(lgN).
  360. **/
  361. epi = ep_find(ep, tf.file, fd);
  362. error = -EINVAL;
  363. switch (op) {
  364. /**
  365. 首先关心添加
  366. **/
  367. case EPOLL_CTL_ADD:
  368. if (!epi) {
  369. /**
  370. 如果ep_find()没有找到相关的epitem,证明是第一次插入.
  371. 在此可以看到,内核总会关心POLLERR和POLLHUP.
  372. **/
  373. epds.events |= POLLERR | POLLHUP;
  374. /**
  375. 红黑树插入,ep_insert()的具体分析在下面
  376. **/
  377. error = ep_insert(ep, &epds, tf.file, fd, full_check);
  378. } else
  379. /**
  380. 如果找到了,则是重复添加
  381. **/
  382. error = -EEXIST;
  383. if (full_check) // TODO
  384. clear_tfile_check_list();
  385. break;
  386. case EPOLL_CTL_DEL:
  387. /**
  388. 删除
  389. **/
  390. if (epi)
  391. error = ep_remove(ep, epi);
  392. else
  393. error = -ENOENT;
  394. break;
  395. case EPOLL_CTL_MOD:
  396. /**
  397. 修改
  398. **/
  399. if (epi) {
  400. epds.events |= POLLERR | POLLHUP;
  401. error = ep_modify(ep, epi, &epds);
  402. } else
  403. error = -ENOENT;
  404. break;
  405. }
  406. if (tep != NULL)
  407. mutex_unlock(&tep->mtx);
  408. mutex_unlock(&ep->mtx); // 解锁
  409. error_tgt_fput:
  410. if (full_check)
  411. mutex_unlock(&epmutex);
  412. fdput(tf);
  413. error_fput:
  414. fdput(f);
  415. error_return:
  416. return error;
  417. }
  418. /*
  419. * Must be called with "mtx" held.
  420. */
  421. /**
  422. ep_insert()在epoll_ctl()中被调用,其工作是往epollfd的红黑树中添加一个待监听fd.
  423. **/
  424. static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
  425. struct file *tfile, int fd, int full_check)
  426. {
  427. int error, revents, pwake = 0;
  428. unsigned long flags;
  429. long user_watches;
  430. struct epitem *epi;
  431. struct ep_pqueue epq;
  432. /**
  433. struct ep_pqueue的定义如下:
  434. @ fs/eventpoll.c
  435. // Wrapper struct used by poll queueing
  436. struct ep_pqueue {
  437. poll_table pt;
  438. struct epitem *epi;
  439. };
  440. **/
  441. /**
  442. 查看是否达到当前用户的最大监听数
  443. **/
  444. user_watches = atomic_long_read(&ep->user->epoll_watches);
  445. if (unlikely(user_watches >= max_user_watches))
  446. return -ENOSPC;
  447. /**
  448. 从slab中分配一个epitem
  449. **/
  450. if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
  451. return -ENOMEM;
  452. /* Item initialization follow here ... */
  453. /**
  454. 相关数据成员的初始化
  455. **/
  456. INIT_LIST_HEAD(&epi->rdllink);
  457. INIT_LIST_HEAD(&epi->fllink);
  458. INIT_LIST_HEAD(&epi->pwqlist);
  459. epi->ep = ep;
  460. /**
  461. 在该epitem中保存待监听的fd和它的file结构.
  462. **/
  463. ep_set_ffd(&epi->ffd, tfile, fd);
  464. epi->event = *event;
  465. epi->nwait = 0;
  466. epi->next = EP_UNACTIVE_PTR;
  467. if (epi->event.events & EPOLLWAKEUP) {
  468. error = ep_create_wakeup_source(epi);
  469. if (error)
  470. goto error_create_wakeup_source;
  471. } else {
  472. RCU_INIT_POINTER(epi->ws, NULL);
  473. }
  474. /* Initialize the poll table using the queue callback */
  475. epq.epi = epi;
  476. /**
  477. 初始化一个poll_table,
  478. 其实质是指定调用poll_wait()时(不是epoll_wait)的回调函数,以及我们关心哪些event.
  479. ep_ptable_queue_proc()就是我们的回调函数,初值是所有event都关心.
  480. ep_ptable_queue_proc()的具体分析在下面.
  481. **/
  482. init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
  483. /*
  484. * Attach the item to the poll hooks and get current event bits.
  485. * We can safely use the file* here because its usage count has
  486. * been increased by the caller of this function. Note that after
  487. * this operation completes, the poll callback can start hitting
  488. * the new item.
  489. */
  490. revents = ep_item_poll(epi, &epq.pt);
  491. /**
  492. ep_item_poll()的定义如下:
  493. @ fs/eventpoll.c
  494. static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
  495. {
  496. pt->_key = epi->event.events;
  497. return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
  498. }
  499. **/
  500. /**
  501. f_op->poll()一般来说只是个wrapper,它会调用真正的poll实现.
  502. 拿UDP的socket来举例,调用流程如下:
  503. f_op->poll(),sock_poll(),udp_poll(),datagram_poll(),sock_poll_wait(),
  504. 最后调用到上面指定的ep_ptable_queue_proc().
  505. 完成这一步,该epitem就跟这个socket关联起来了,当后者有状态变化时,会通过ep_poll_callback()
  506. 来通知.
  507. 所以,f_op->poll()做了两件事情:
  508. 1.将该epitem和这个待监听的fd关联起来;
  509. 2.查询这个待监听的fd是否已经有event已经ready了,有的话就将event返回.
  510. **/
  511. /*
  512. * We have to check if something went wrong during the poll wait queue
  513. * install process. Namely an allocation for a wait queue failed due
  514. * high memory pressure.
  515. */
  516. error = -ENOMEM;
  517. if (epi->nwait < 0)
  518. goto error_unregister;
  519. /* Add the current item to the list of active epoll hook for this file */
  520. /**
  521. 把每个文件和对应的epitem关联起来
  522. **/
  523. spin_lock(&tfile->f_lock);
  524. list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
  525. spin_unlock(&tfile->f_lock);
  526. /*
  527. * Add the current item to the RB tree. All RB tree operations are
  528. * protected by "mtx", and ep_insert() is called with "mtx" held.
  529. */
  530. /**
  531. 将epitem插入到eventpoll的红黑树中
  532. **/
  533. ep_rbtree_insert(ep, epi);
  534. /* now check if we've created too many backpaths */
  535. error = -EINVAL;
  536. if (full_check && reverse_path_check())
  537. goto error_remove_epi;
  538. /* We have to drop the new item inside our item list to keep track of it */
  539. spin_lock_irqsave(&ep->lock, flags); // TODO
  540. /* If the file is already "ready" we drop it inside the ready list */
  541. /**
  542. 在这里,如果待监听的fd已经有事件发生,就去处理一下
  543. **/
  544. if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
  545. /**
  546. 将当前的epitem加入到ready list中去
  547. **/
  548. list_add_tail(&epi->rdllink, &ep->rdllist);
  549. ep_pm_stay_awake(epi);
  550. /* Notify waiting tasks that events are available */
  551. /**
  552. 哪个进程在调用epoll_wait(),就唤醒它
  553. **/
  554. if (waitqueue_active(&ep->wq))
  555. wake_up_locked(&ep->wq);
  556. /**
  557. 先不通知对eventpoll进行poll的进程
  558. **/
  559. if (waitqueue_active(&ep->poll_wait))
  560. pwake++;
  561. }
  562. spin_unlock_irqrestore(&ep->lock, flags);
  563. atomic_long_inc(&ep->user->epoll_watches);
  564. /* We have to call this outside the lock */
  565. if (pwake)
  566. /**
  567. 安全地通知对eventpoll进行poll的进程
  568. **/
  569. ep_poll_safewake(&ep->poll_wait);
  570. return 0;
  571. error_remove_epi:
  572. spin_lock(&tfile->f_lock);
  573. list_del_rcu(&epi->fllink);
  574. spin_unlock(&tfile->f_lock);
  575. rb_erase(&epi->rbn, &ep->rbr);
  576. error_unregister:
  577. ep_unregister_pollwait(ep, epi);
  578. /*
  579. * We need to do this because an event could have been arrived on some
  580. * allocated wait queue. Note that we don't care about the ep->ovflist
  581. * list, since that is used/cleaned only inside a section bound by "mtx".
  582. * And ep_insert() is called with "mtx" held.
  583. */
  584. spin_lock_irqsave(&ep->lock, flags);
  585. if (ep_is_linked(&epi->rdllink))
  586. list_del_init(&epi->rdllink);
  587. spin_unlock_irqrestore(&ep->lock, flags);
  588. wakeup_source_unregister(ep_wakeup_source(epi));
  589. error_create_wakeup_source:
  590. kmem_cache_free(epi_cache, epi);
  591. return error;
  592. }
  593. /*
  594. * This is the callback that is used to add our wait queue to the
  595. * target file wakeup lists.
  596. */
  597. /**
  598. 该函数在调用f_op->poll()时被调用.
  599. 其作用是当epoll主动poll某个待监听fd时,将epitem和该fd关联起来.
  600. 关联的方法是使用等待队列.
  601. **/
  602. static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
  603. poll_table *pt)
  604. {
  605. struct epitem *epi = ep_item_from_epqueue(pt);
  606. struct eppoll_entry *pwq;
  607. /**
  608. @ fs/eventpoll.c
  609. // Wait structure used by the poll hooks
  610. struct eppoll_entry {
  611. // List header used to link this structure to the "struct epitem"
  612. struct list_head llink;
  613. // The "base" pointer is set to the container "struct epitem"
  614. struct epitem *base;
  615. // Wait queue item that will be linked to the target file wait
  616. // queue head.
  617. wait_queue_t wait;
  618. // The wait queue head that linked the "wait" wait queue item
  619. wait_queue_head_t *whead;
  620. };
  621. **/
  622. if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
  623. /**
  624. 初始化等待队列,指定ep_poll_callback()为唤醒时的回调函数.
  625. 当监听的fd发生状态改变时,即队列头被唤醒时,指定的回调函数会被调用.
  626. **/
  627. init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); // ep_poll_callback()的具体分析在下面
  628. pwq->whead = whead;
  629. pwq->base = epi;
  630. add_wait_queue(whead, &pwq->wait);
  631. list_add_tail(&pwq->llink, &epi->pwqlist);
  632. epi->nwait++;
  633. } else {
  634. /* We have to signal that an error occurred */
  635. epi->nwait = -1;
  636. }
  637. }
  638. /*
  639. * This is the callback that is passed to the wait queue wakeup
  640. * mechanism. It is called by the stored file descriptors when they
  641. * have events to report.
  642. */
  643. /**
  644. 这是一个关键的回调函数.
  645. 当被监听的fd发生状态改变时,该函数会被调用.
  646. 参数key指向events.
  647. **/
  648. static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
  649. {
  650. int pwake = 0;
  651. unsigned long flags;
  652. struct epitem *epi = ep_item_from_wait(wait); // 从等待队列获取epitem
  653. struct eventpoll *ep = epi->ep;
  654. spin_lock_irqsave(&ep->lock, flags);
  655. /*
  656. * If the event mask does not contain any poll(2) event, we consider the
  657. * descriptor to be disabled. This condition is likely the effect of the
  658. * EPOLLONESHOT bit that disables the descriptor when an event is received,
  659. * until the next EPOLL_CTL_MOD will be issued.
  660. */
  661. if (!(epi->event.events & ~EP_PRIVATE_BITS))
  662. goto out_unlock;
  663. /*
  664. * Check the events coming with the callback. At this stage, not
  665. * every device reports the events in the "key" parameter of the
  666. * callback. We need to be able to handle both cases here, hence the
  667. * test for "key" != NULL before the event match test.
  668. */
  669. /**
  670. 没有我们关心的event
  671. **/
  672. if (key && !((unsigned long) key & epi->event.events))
  673. goto out_unlock;
  674. /*
  675. * If we are transferring events to userspace, we can hold no locks
  676. * (because we're accessing user memory, and because of linux f_op->poll()
  677. * semantics). All the events that happen during that period of time are
  678. * chained in ep->ovflist and requeued later on.
  679. */
  680. /**
  681. 如果该函数被调用时,epoll_wait()已经返回了,
  682. 即此时应用程序已经在循环中获取events了,
  683. 这种情况下,内核将此刻发生状态改变的epitem用一个单独的链表保存起来,并且在下一次epoll_wait()
  684. 时返回给用户.这个单独的链表就是ovflist.
  685. */
  686. if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
  687. if (epi->next == EP_UNACTIVE_PTR) {
  688. epi->next = ep->ovflist;
  689. ep->ovflist = epi;
  690. if (epi->ws) {
  691. /*
  692. * Activate ep->ws since epi->ws may get
  693. * deactivated at any time.
  694. */
  695. __pm_stay_awake(ep->ws);
  696. }
  697. }
  698. goto out_unlock;
  699. }
  700. /* If this file is already in the ready list we exit soon */
  701. /**
  702. 将当前epitem添加到ready list中
  703. **/
  704. if (!ep_is_linked(&epi->rdllink)) {
  705. list_add_tail(&epi->rdllink, &ep->rdllist);
  706. ep_pm_stay_awake_rcu(epi);
  707. }
  708. /*
  709. * Wake up ( if active ) both the eventpoll wait list and the ->poll()
  710. * wait list.
  711. */
  712. /**
  713. 唤醒调用epoll_wait()的进程
  714. **/
  715. if (waitqueue_active(&ep->wq))
  716. wake_up_locked(&ep->wq);
  717. /**
  718. 先不通知对eventpoll进行poll的进程
  719. **/
  720. if (waitqueue_active(&ep->poll_wait))
  721. pwake++;
  722. out_unlock:
  723. spin_unlock_irqrestore(&ep->lock, flags);
  724. /* We have to call this outside the lock */
  725. if (pwake)
  726. /**
  727. 安全地通知对eventpoll进行poll的进程
  728. **/
  729. ep_poll_safewake(&ep->poll_wait);
  730. if ((unsigned long)key & POLLFREE) {
  731. /*
  732. * If we race with ep_remove_wait_queue() it can miss
  733. * ->whead = NULL and do another remove_wait_queue() after
  734. * us, so we can't use __remove_wait_queue().
  735. */
  736. list_del_init(&wait->task_list);
  737. /*
  738. * ->whead != NULL protects us from the race with ep_free()
  739. * or ep_remove(), ep_remove_wait_queue() takes whead->lock
  740. * held by the caller. Once we nullify it, nothing protects
  741. * ep/epi or even wait.
  742. */
  743. smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
  744. }
  745. return 1;
  746. }
  747. /*
  748. * Implement the event wait interface for the eventpoll file. It is the kernel
  749. * part of the user space epoll_wait(2).
  750. */
  751. SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
  752. int, maxevents, int, timeout)
  753. {
  754. int error;
  755. struct fd f;
  756. struct eventpoll *ep;
  757. /* The maximum number of event must be greater than zero */
  758. if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
  759. return -EINVAL;
  760. /* Verify that the area passed by the user is writeable */
  761. /**
  762. 内核要验证这一段用户空间的内存是不是有效的,可写的.
  763. **/
  764. if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
  765. return -EFAULT;
  766. /* Get the "struct file *" for the eventpoll file */
  767. /**
  768. 获取epollfd的file结构
  769. **/
  770. f = fdget(epfd);
  771. if (!f.file)
  772. return -EBADF;
  773. /*
  774. * We have to check that the file structure underneath the fd
  775. * the user passed to us _is_ an eventpoll file.
  776. */
  777. error = -EINVAL;
  778. /**
  779. 检查它是不是一个真正的epollfd
  780. **/
  781. if (!is_file_epoll(f.file))
  782. goto error_fput;
  783. /*
  784. * At this point it is safe to assume that the "private_data" contains
  785. * our own data structure.
  786. */
  787. /**
  788. 获取eventpoll结构
  789. **/
  790. ep = f.file->private_data;
  791. /* Time to fish for events ... */
  792. /**
  793. 睡眠,等待事件到来.
  794. ep_poll()的具体分析在下面.
  795. **/
  796. error = ep_poll(ep, events, maxevents, timeout);
  797. error_fput:
  798. fdput(f);
  799. return error;
  800. }
  801. /**
  802. * ep_poll - Retrieves ready events, and delivers them to the caller supplied
  803. * event buffer.
  804. *
  805. * @ep: Pointer to the eventpoll context.
  806. * @events: Pointer to the userspace buffer where the ready events should be
  807. * stored.
  808. * @maxevents: Size (in terms of number of events) of the caller event buffer.
  809. * @timeout: Maximum timeout for the ready events fetch operation, in
  810. * milliseconds. If the @timeout is zero, the function will not block,
  811. * while if the @timeout is less than zero, the function will block
  812. * until at least one event has been retrieved (or an error
  813. * occurred).
  814. *
  815. * Returns: Returns the number of ready events which have been fetched, or an
  816. * error code, in case of error.
  817. */
  818. /**
  819. 执行epoll_wait()的进程在该函数进入休眠状态.
  820. **/
  821. static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
  822. int maxevents, long timeout)
  823. {
  824. int res = 0, eavail, timed_out = 0;
  825. unsigned long flags;
  826. long slack = 0;
  827. wait_queue_t wait;
  828. ktime_t expires, *to = NULL;
  829. if (timeout > 0) {
  830. /**
  831. 计算睡眠时间
  832. **/
  833. struct timespec end_time = ep_set_mstimeout(timeout);
  834. slack = select_estimate_accuracy(&end_time);
  835. to = &expires;
  836. *to = timespec_to_ktime(end_time);
  837. } else if (timeout == 0) {
  838. /**
  839. 已经超时,直接检查ready list
  840. **/
  841. /*
  842. * Avoid the unnecessary trip to the wait queue loop, if the
  843. * caller specified a non blocking operation.
  844. */
  845. timed_out = 1;
  846. spin_lock_irqsave(&ep->lock, flags);
  847. goto check_events;
  848. }
  849. fetch_events:
  850. spin_lock_irqsave(&ep->lock, flags);
  851. /**
  852. 没有可用的事件,即ready list和ovflist均为空.
  853. **/
  854. if (!ep_events_available(ep)) {
  855. /*
  856. * We don't have any available event to return to the caller.
  857. * We need to sleep here, and we will be wake up by
  858. * ep_poll_callback() when events will become available.
  859. */
  860. /**
  861. 初始化一个等待队列成员,current是当前进程.
  862. 然后把该等待队列成员添加到ep的等待队列中,即当前进程把自己添加到等待队列中.
  863. **/
  864. init_waitqueue_entry(&wait, current);
  865. __add_wait_queue_exclusive(&ep->wq, &wait);
  866. for (;;) {
  867. /*
  868. * We don't want to sleep if the ep_poll_callback() sends us
  869. * a wakeup in between. That's why we set the task state
  870. * to TASK_INTERRUPTIBLE before doing the checks.
  871. */
  872. /**
  873. 将当前进程的状态设置为睡眠时可以被信号唤醒.
  874. 仅仅是状态设置,还没有睡眠.
  875. **/
  876. set_current_state(TASK_INTERRUPTIBLE);
  877. /**
  878. 如果此时,ready list已经有成员了,或者已经超时,则不进入睡眠.
  879. **/
  880. if (ep_events_available(ep) || timed_out)
  881. break;
  882. /**
  883. 如果有信号产生,不进入睡眠.
  884. **/
  885. if (signal_pending(current)) {
  886. res = -EINTR;
  887. break;
  888. }
  889. spin_unlock_irqrestore(&ep->lock, flags);
  890. /**
  891. 挂起当前进程,等待被唤醒或超时
  892. **/
  893. if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
  894. timed_out = 1;
  895. spin_lock_irqsave(&ep->lock, flags);
  896. }
  897. __remove_wait_queue(&ep->wq, &wait); // 把当前进程从该epollfd的等待队列中删除.
  898. __set_current_state(TASK_RUNNING); // 将当前进程的状态设置为可运行.
  899. }
  900. check_events:
  901. /* Is it worth to try to dig for events ? */
  902. eavail = ep_events_available(ep);
  903. spin_unlock_irqrestore(&ep->lock, flags);
  904. /*
  905. * Try to transfer events to user space. In case we get 0 events and
  906. * there's still timeout left over, we go trying again in search of
  907. * more luck.
  908. */
  909. /**
  910. 如果一切正常,并且有event发生,则拷贝数据给用户空间
  911. **/
  912. // ep_send_events()的具体分析在下面
  913. if (!res && eavail &&
  914. !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
  915. goto fetch_events;
  916. return res;
  917. }
  918. static int ep_send_events(struct eventpoll *ep,
  919. struct epoll_event __user *events, int maxevents)
  920. {
  921. struct ep_send_events_data esed;
  922. /**
  923. @ fs/eventpoll.c
  924. // Used by the ep_send_events() function as callback private data
  925. struct ep_send_events_data {
  926. int maxevents;
  927. struct epoll_event __user *events;
  928. };
  929. **/
  930. esed.maxevents = maxevents;
  931. esed.events = events;
  932. // ep_scan_ready_list()的具体分析在下面
  933. return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
  934. }
  935. /**
  936. * ep_scan_ready_list - Scans the ready list in a way that makes possible for
  937. * the scan code, to call f_op->poll(). Also allows for
  938. * O(NumReady) performance.
  939. *
  940. * @ep: Pointer to the epoll private data structure.
  941. * @sproc: Pointer to the scan callback.
  942. * @priv: Private opaque data passed to the @sproc callback.
  943. * @depth: The current depth of recursive f_op->poll calls.
  944. * @ep_locked: caller already holds ep->mtx
  945. *
  946. * Returns: The same integer error code returned by the @sproc callback.
  947. */
  948. static int ep_scan_ready_list(struct eventpoll *ep,
  949. int (*sproc)(struct eventpoll *,
  950. struct list_head *, void *),
  951. void *priv, int depth, bool ep_locked)
  952. {
  953. int error, pwake = 0;
  954. unsigned long flags;
  955. struct epitem *epi, *nepi;
  956. LIST_HEAD(txlist);
  957. /*
  958. * We need to lock this because we could be hit by
  959. * eventpoll_release_file() and epoll_ctl().
  960. */
  961. if (!ep_locked)
  962. mutex_lock_nested(&ep->mtx, depth);
  963. /*
  964. * Steal the ready list, and re-init the original one to the
  965. * empty list. Also, set ep->ovflist to NULL so that events
  966. * happening while looping w/out locks, are not lost. We cannot
  967. * have the poll callback to queue directly on ep->rdllist,
  968. * because we want the "sproc" callback to be able to do it
  969. * in a lockless way.
  970. */
  971. spin_lock_irqsave(&ep->lock, flags);
  972. /**
  973. 将ready list上的epitem(即监听事件发生状态改变的epitem)移动到txlist,
  974. 并且将ready list清空.
  975. **/
  976. list_splice_init(&ep->rdllist, &txlist);
  977. /**
  978. 改变ovflist的值.
  979. 在上面的ep_poll_callback()中可以看到,如果ovflist != EP_UNACTIVE_PTR,当等待队列成员被激活时,
  980. 就会将对应的epitem加入到ep->ovflist中,否则加入到ep->rdllist中.
  981. 所以这里是为了防止把新来的发生状态改变的epitem加入到ready list中.
  982. **/
  983. ep->ovflist = NULL;
  984. spin_unlock_irqrestore(&ep->lock, flags);
  985. /*
  986. * Now call the callback function.
  987. */
  988. /**
  989. 调用扫描函数处理txlist.
  990. 该扫描函数就是ep_send_events_proc.具体分析在下面.
  991. **/
  992. error = (*sproc)(ep, &txlist, priv);
  993. spin_lock_irqsave(&ep->lock, flags);
  994. /*
  995. * During the time we spent inside the "sproc" callback, some
  996. * other events might have been queued by the poll callback.
  997. * We re-insert them inside the main ready-list here.
  998. */
  999. /**
  1000. 在调用sproc()期间,可能会有新的事件发生(被添加到ovflist中),遍历这些发生新事件的epitem,
  1001. 将它们插入到ready list中.
  1002. **/
  1003. for (nepi = ep->ovflist; (epi = nepi) != NULL;
  1004. nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
  1005. /**
  1006. @ fs/eventpoll.c
  1007. #define EP_UNACTIVE_PTR ((void *) -1L)
  1008. **/
  1009. /*
  1010. * We need to check if the item is already in the list.
  1011. * During the "sproc" callback execution time, items are
  1012. * queued into ->ovflist but the "txlist" might already
  1013. * contain them, and the list_splice() below takes care of them.
  1014. */
  1015. /**
  1016. epitem不在ready list?插入!
  1017. **/
  1018. if (!ep_is_linked(&epi->rdllink)) {
  1019. list_add_tail(&epi->rdllink, &ep->rdllist);
  1020. ep_pm_stay_awake(epi);
  1021. }
  1022. }
  1023. /*
  1024. * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
  1025. * releasing the lock, events will be queued in the normal way inside
  1026. * ep->rdllist.
  1027. */
  1028. /**
  1029. 还原ovflist的状态
  1030. **/
  1031. ep->ovflist = EP_UNACTIVE_PTR;
  1032. /*
  1033. * Quickly re-inject items left on "txlist".
  1034. */
  1035. /**
  1036. 将上次没有处理完的epitem,重新插入到ready list中.
  1037. **/
  1038. list_splice(&txlist, &ep->rdllist);
  1039. __pm_relax(ep->ws);
  1040. /**
  1041. 如果ready list不为空,唤醒.
  1042. **/
  1043. if (!list_empty(&ep->rdllist)) {
  1044. /*
  1045. * Wake up (if active) both the eventpoll wait list and
  1046. * the ->poll() wait list (delayed after we release the lock).
  1047. */
  1048. if (waitqueue_active(&ep->wq))
  1049. wake_up_locked(&ep->wq);
  1050. if (waitqueue_active(&ep->poll_wait))
  1051. pwake++;
  1052. }
  1053. spin_unlock_irqrestore(&ep->lock, flags);
  1054. if (!ep_locked)
  1055. mutex_unlock(&ep->mtx);
  1056. /* We have to call this outside the lock */
  1057. if (pwake)
  1058. ep_poll_safewake(&ep->poll_wait);
  1059. return error;
  1060. }
  1061. /**
  1062. 该函数作为callback在ep_scan_ready_list()中被调用.
  1063. head是一个链表头,链接着已经ready了的epitem.
  1064. 这个链表不是eventpoll的ready list,而是上面函数中的txlist.
  1065. **/
  1066. static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
  1067. void *priv)
  1068. {
  1069. struct ep_send_events_data *esed = priv;
  1070. int eventcnt;
  1071. unsigned int revents;
  1072. struct epitem *epi;
  1073. struct epoll_event __user *uevent;
  1074. struct wakeup_source *ws;
  1075. poll_table pt;
  1076. init_poll_funcptr(&pt, NULL);
  1077. /*
  1078. * We can loop without lock because we are passed a task private list.
  1079. * Items cannot vanish during the loop because ep_scan_ready_list() is
  1080. * holding "mtx" during this call.
  1081. */
  1082. /**
  1083. 遍历整个链表
  1084. **/
  1085. for (eventcnt = 0, uevent = esed->events;
  1086. !list_empty(head) && eventcnt < esed->maxevents;) {
  1087. /**
  1088. 取出第一个结点
  1089. **/
  1090. epi = list_first_entry(head, struct epitem, rdllink);
  1091. /*
  1092. * Activate ep->ws before deactivating epi->ws to prevent
  1093. * triggering auto-suspend here (in case we reactive epi->ws
  1094. * below).
  1095. *
  1096. * This could be rearranged to delay the deactivation of epi->ws
  1097. * instead, but then epi->ws would temporarily be out of sync
  1098. * with ep_is_linked().
  1099. */
  1100. // TODO
  1101. ws = ep_wakeup_source(epi);
  1102. if (ws) {
  1103. if (ws->active)
  1104. __pm_stay_awake(ep->ws);
  1105. __pm_relax(ws);
  1106. }
  1107. /**
  1108. 从ready list中删除该结点
  1109. **/
  1110. list_del_init(&epi->rdllink);
  1111. /**
  1112. 获取ready事件掩码
  1113. **/
  1114. revents = ep_item_poll(epi, &pt);
  1115. /**
  1116. ep_item_poll()的具体分析在上面的ep_insert()中.
  1117. **/
  1118. /*
  1119. * If the event mask intersect the caller-requested one,
  1120. * deliver the event to userspace. Again, ep_scan_ready_list()
  1121. * is holding "mtx", so no operations coming from userspace
  1122. * can change the item.
  1123. */
  1124. if (revents) {
  1125. /**
  1126. 将ready事件和用户传入的数据都拷贝到用户空间
  1127. **/
  1128. if (__put_user(revents, &uevent->events) ||
  1129. __put_user(epi->event.data, &uevent->data)) {
  1130. list_add(&epi->rdllink, head);
  1131. ep_pm_stay_awake(epi);
  1132. return eventcnt ? eventcnt : -EFAULT;
  1133. }
  1134. eventcnt++;
  1135. uevent++;
  1136. if (epi->event.events & EPOLLONESHOT)
  1137. epi->event.events &= EP_PRIVATE_BITS;
  1138. else if (!(epi->event.events & EPOLLET)) {
  1139. /**
  1140. 边缘触发(ET)和水平触发(LT)的区别:
  1141. 如果是ET,就绪epitem不会再次被加入到ready list中,除非fd再次发生状态改变,ep_poll_callback被调用.
  1142. 如果是LT,不论是否还有有效的事件和数据,epitem都会被再次加入到ready list中,在下次epoll_wait()时会
  1143. 立即返回,并通知用户空间.当然如果这个被监听的fd确实没有事件和数据,epoll_wait()会返回一个0.
  1144. **/
  1145. /*
  1146. * If this file has been added with Level
  1147. * Trigger mode, we need to insert back inside
  1148. * the ready list, so that the next call to
  1149. * epoll_wait() will check again the events
  1150. * availability. At this point, no one can insert
  1151. * into ep->rdllist besides us. The epoll_ctl()
  1152. * callers are locked out by
  1153. * ep_scan_ready_list() holding "mtx" and the
  1154. * poll callback will queue them in ep->ovflist.
  1155. */
  1156. list_add_tail(&epi->rdllink, &ep->rdllist);
  1157. ep_pm_stay_awake(epi);
  1158. }
  1159. }
  1160. }
  1161. return eventcnt;
  1162. }
  1163. /**
  1164. 该函数在epollfd被close时调用,其工作是释放一些资源.
  1165. **/
  1166. static void ep_free(struct eventpoll *ep)
  1167. {
  1168. struct rb_node *rbp;
  1169. struct epitem *epi;
  1170. /* We need to release all tasks waiting for these file */
  1171. if (waitqueue_active(&ep->poll_wait))
  1172. ep_poll_safewake(&ep->poll_wait);
  1173. /*
  1174. * We need to lock this because we could be hit by
  1175. * eventpoll_release_file() while we're freeing the "struct eventpoll".
  1176. * We do not need to hold "ep->mtx" here because the epoll file
  1177. * is on the way to be removed and no one has references to it
  1178. * anymore. The only hit might come from eventpoll_release_file() but
  1179. * holding "epmutex" is sufficient here.
  1180. */
  1181. mutex_lock(&epmutex);
  1182. /*
  1183. * Walks through the whole tree by unregistering poll callbacks.
  1184. */
  1185. for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
  1186. epi = rb_entry(rbp, struct epitem, rbn);
  1187. ep_unregister_pollwait(ep, epi);
  1188. cond_resched();
  1189. }
  1190. /*
  1191. * Walks through the whole tree by freeing each "struct epitem". At this
  1192. * point we are sure no poll callbacks will be lingering around, and also by
  1193. * holding "epmutex" we can be sure that no file cleanup code will hit
  1194. * us during this operation. So we can avoid the lock on "ep->lock".
  1195. * We do not need to lock ep->mtx, either, we only do it to prevent
  1196. * a lockdep warning.
  1197. */
  1198. mutex_lock(&ep->mtx);
  1199. /**
  1200. 在epoll_ctl()中被添加的监听fd,在这里被关闭.
  1201. **/
  1202. while ((rbp = rb_first(&ep->rbr)) != NULL) {
  1203. epi = rb_entry(rbp, struct epitem, rbn);
  1204. ep_remove(ep, epi);
  1205. cond_resched();
  1206. }
  1207. mutex_unlock(&ep->mtx);
  1208. mutex_unlock(&epmutex);
  1209. mutex_destroy(&ep->mtx);
  1210. free_uid(ep->user);
  1211. wakeup_source_unregister(ep->ws);
  1212. kfree(ep);
  1213. }