bufio - 《Golang 源码阅读笔记》

bufio.go
scan.go

看库名就知道，是带缓存区的 io

bufio.go

实现了两个结构体，分别是带 buffer 的 Reader 和带 buffer 的 Writer 源码如下

// src/bufio/bufio.go ---- line 30
// Reader implements buffering for an io.Reader object.
type Reader struct {
    buf          []byte
    rd           io.Reader // reader provided by the client
    r, w         int       // buf read and write positions
    err          error
    lastByte     int // last byte read for UnreadByte; -1 means invalid
    lastRuneSize int // size of last rune read for UnreadRune; -1 means invalid
}
// src/bufio/bufio.go ---- line 536
// Writer implements buffering for an io.Writer object.
// If an error occurs writing to a Writer, no more data will be
// accepted and all subsequent writes, and Flush, will return the error.
// After all data has been written, the client should call the
// Flush method to guarantee all data has been forwarded to
// the underlying io.Writer.
type Writer struct {
    err error
    buf []byte
    n   int
    wr  io.Writer
}

比较有趣的是 Reader 的 ReadSlice 方法，可惜只接受单 byte 作为参数，我曾经实现过一个接受 []byte 的相同功能的函数，早知道这里有，就直接从这里改进了。要注意的是这个函数返回的字节切片指向的底层数组是和 Reader 中 buffer 指向的底层数组相同的，意味着会被覆盖，所以一般使用 ReadBytes 来代替。

// src/bufio/bufio.go ---- line 320
// ReadSlice reads until the first occurrence of delim in the input,
// returning a slice pointing at the bytes in the buffer.
// The bytes stop being valid at the next read.
// If ReadSlice encounters an error before finding a delimiter,
// it returns all the data in the buffer and the error itself (often io.EOF).
// ReadSlice fails with error ErrBufferFull if the buffer fills without a delim.
// Because the data returned from ReadSlice will be overwritten
// by the next I/O operation, most clients should use
// ReadBytes or ReadString instead.
// ReadSlice returns err != nil if and only if line does not end in delim.
func (b *Reader) ReadSlice(delim byte) (line []byte, err error) {
    s := 0 // search start index
    for {
        // Search buffer.
        if i := bytes.IndexByte(b.buf[b.r+s:b.w], delim); i >= 0 {
            i += s
            line = b.buf[b.r : b.r+i+1]
            b.r += i + 1
            break
        }
        // Pending error?
        if b.err != nil {
            line = b.buf[b.r:b.w]
            b.r = b.w
            err = b.readErr()
            break
        }
        // Buffer full?
        if b.Buffered() >= len(b.buf) {
            b.r = b.w
            line = b.buf
            err = ErrBufferFull
            break
        }
        s = b.w - b.r // do not rescan area we scanned before
        b.fill() // buffer is not full
    }
    // Handle last byte, if any.
    if i := len(line) - 1; i >= 0 {
        b.lastByte = int(line[i])
        b.lastRuneSize = -1
    }
    return
}

大体都很简单，没有很难理解的东西。
还发现一些小瑕疵，结构体 Reader 的 Buffered 方法返回 b.w - b.r 这个值，而 Size 返回 len(b.buf) 值，分别是缓存区中缓存内容的长度和缓存区长度。但是在其他方法中，涉及到 b.w - b.r 和 len(b.buf) 时，有的是直接调用 Buffered 和 Size 方法，有的又显式写出来，规范不统一，看起来很奇怪。比如上述代码块的第 32 行和 Peek 方法中的几行。

还有一点需要注意，在往 Writer 写入数据时，如果写入内容不超过 buffer 长度的部分是不会自动 Flush 的。所以说每次写都会有 len(data) % len(w.buf) 的数据保留在 buffer 中没有 Flush 需要调用者显式 Flush

scan.go

看了好一会都没看懂这个 Scanner 是干嘛的，然后自己写了个测试代码，一下子就搞懂了，果然实践出真知。下面给出 Scanner 结构体和 Scan 方法。

// src/bufio/scan.go ---- line 14
// Scanner provides a convenient interface for reading data such as
// a file of newline-delimited lines of text. Successive calls to
// the Scan method will step through the 'tokens' of a file, skipping
// the bytes between the tokens. The specification of a token is
// defined by a split function of type SplitFunc; the default split
// function breaks the input into lines with line termination stripped. Split
// functions are defined in this package for scanning a file into
// lines, bytes, UTF-8-encoded runes, and space-delimited words. The
// client may instead provide a custom split function.
//
// Scanning stops unrecoverably at EOF, the first I/O error, or a token too
// large to fit in the buffer. When a scan stops, the reader may have
// advanced arbitrarily far past the last token. Programs that need more
// control over error handling or large tokens, or must run sequential scans
// on a reader, should use bufio.Reader instead.
//
type Scanner struct {
    r            io.Reader // The reader provided by the client.
    split        SplitFunc // The function to split the tokens.
    maxTokenSize int       // Maximum size of a token; modified by tests.
    token        []byte    // Last token returned by split.
    buf          []byte    // Buffer used as argument to split.
    start        int       // First non-processed byte in buf.
    end          int       // End of data in buf.
    err          error     // Sticky error.
    empties      int       // Count of successive empty tokens.
    scanCalled   bool      // Scan has been called; buffer is in use.
    done         bool      // Scan has finished.
}
// src/bufio/scan.go ---- line 125
// Scan advances the Scanner to the next token, which will then be
// available through the Bytes or Text method. It returns false when the
// scan stops, either by reaching the end of the input or an error.
// After Scan returns false, the Err method will return any error that
// occurred during scanning, except that if it was io.EOF, Err
// will return nil.
// Scan panics if the split function returns too many empty
// tokens without advancing the input. This is a common error mode for
// scanners.
func (s *Scanner) Scan() bool {
    if s.done {
        return false
    }
    s.scanCalled = true
    // Loop until we have a token.
    for {
        // See if we can get a token with what we already have.
        // If we've run out of data but have an error, give the split function
        // a chance to recover any remaining, possibly empty token.
        if s.end > s.start || s.err != nil {
            advance, token, err := s.split(s.buf[s.start:s.end], s.err != nil)
            if err != nil {
                if err == ErrFinalToken {
                    s.token = token
                    s.done = true
                    return true
                }
                s.setErr(err)
                return false
            }
            if !s.advance(advance) {
                return false
            }
            s.token = token
            if token != nil {
                if s.err == nil || advance > 0 {
                    s.empties = 0
                } else {
                    // Returning tokens not advancing input at EOF.
                    s.empties++
                    if s.empties > maxConsecutiveEmptyReads {
                        panic("bufio.Scan: too many empty tokens without progressing")
                    }
                }
                return true
            }
        }
        // We cannot generate a token with what we are holding.
        // If we've already hit EOF or an I/O error, we are done.
        if s.err != nil {
            // Shut it down.
            s.start = 0
            s.end = 0
            return false
        }
        // Must read more data.
        // First, shift data to beginning of buffer if there's lots of empty space
        // or space is needed.
        if s.start > 0 && (s.end == len(s.buf) || s.start > len(s.buf)/2) {
            copy(s.buf, s.buf[s.start:s.end])
            s.end -= s.start
            s.start = 0
        }
        // Is the buffer full? If so, resize.
        if s.end == len(s.buf) {
            // Guarantee no overflow in the multiplication below.
            const maxInt = int(^uint(0) >> 1)
            if len(s.buf) >= s.maxTokenSize || len(s.buf) > maxInt/2 {
                s.setErr(ErrTooLong)
                return false
            }
            newSize := len(s.buf) * 2
            if newSize == 0 {
                newSize = startBufSize
            }
            if newSize > s.maxTokenSize {
                newSize = s.maxTokenSize
            }
            newBuf := make([]byte, newSize)
            copy(newBuf, s.buf[s.start:s.end])
            s.buf = newBuf
            s.end -= s.start
            s.start = 0
        }
        // Finally we can read some input. Make sure we don't get stuck with
        // a misbehaving Reader. Officially we don't need to do this, but let's
        // be extra careful: Scanner is for safe, simple jobs.
        for loop := 0; ; {
            n, err := s.r.Read(s.buf[s.end:len(s.buf)])
            s.end += n
            if err != nil {
                s.setErr(err)
                break
            }
            if n > 0 {
                s.empties = 0
                break
            }
            loop++
            if loop > maxConsecutiveEmptyReads {
                s.setErr(io.ErrNoProgress)
                break
            }
        }
    }
}

Scanner 支持自定义 split 方法。并且提供了内置的几个 split 方法分别是 ScanBytes, ScanRunes, ScanLines, ScanWords
需要注意的是调用过 Scan 方法之后就不能再调用 Buffer 方法来改变缓存区了，否则会 panic