Apr 18, 2013

golang - Profiling Go Programs 記錄(Part 1)

利用了 profiling 工具解決了一些 Go-HttpBench 的 bug 並做了些效能優化

以 ./gb -c 100 -n 200000 -k http://localhost/10k.dat 來比較版本 0.1.00.1.6 的結果(based on Go1.1beta2)

        Mem(RES)   Requests/s   Alloc(Rate)   GC-Overhead   Pause(Avg)   Pause(Max)
0.1.0       354m     12119.99     397.00M/s        25.40%      56.72ms     153.25ms
0.1.6        23m     16976.76      67.48M/s         2.50%       5.12ms      10.07ms
delta(%)    ↓93%         ↑40%          ↓83%        ↓22.9%         ↓91%         ↓93%

Response time   0.1.0   0.1.6
 50%                4       4
 66%                6       5
 75%                7       6
 80%                8       7
 90%               10       9
 95%               12      11
 98%               57      13
 99%               70      15
 100%             179     133

優化過程主要做了下列四項改變,以下將包含總結分五個章節來發佈

  1. 重複使用 Buffer
  2. 程式化停止 Timer
  3. 正常結束 Goroutine
  4. 快取計算結果
  5. 結論

1. 重複使用 Buffer

先在程式內加入 CPU Profiling 程式,分析有沒有明顯的效能瓶頸 (用的指令請參考附件 profilingtool.go)

func main() {
    StartCPUProfile()
    // ignored
    StopCPUProfile()
    PrintGCSummary()
}
$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat

This is GoHttpBench, Version 0.1.0, https://github.com/parkghost/gohttpbench
Author: Brandon Chen, Email: parkghost@gmail.com
Licensed under the Apache License, Version 2.0

Benchmarking localhost (be patient)
Completed 20000 requests
Completed 40000 requests
Completed 60000 requests
Completed 80000 requests
Completed 100000 requests
Completed 120000 requests
Completed 140000 requests
Completed 160000 requests
Completed 180000 requests
Completed 200000 requests
Finished 200000 requests

Server Software:        nginx/1.2.1
Server Hostname:        localhost
Server Port:            80

Document Path:          /10k.dat
Document Length:        10240 bytes

Concurrency Level:      100
Time taken for tests:   19.00 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    10526.49 [#/sec] (mean)
Time per request:       9.500 [ms] (mean)
Time per request:       0.095 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     105264.94 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   12.79   5   164

Percentage of the requests served within a certain time (ms)
 50%     5
 66%     7
 75%     8
 80%     9
 90%     12
 95%     16
 98%     60
 99%     77
 100%    164 (longest request)

NumGC:77 Pause:41.17ms Pause(Avg):61.92ms Overhead:24.21% Alloc:48.57M Sys:614.81M Alloc(Rate):346.57M/s Histogram:108.41ms 124.72ms 124.72ms 

用 top 搜集 gb 相關系統資源使用狀況

$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
27047 brandon   20   0  597m  10m 2840 S   4.9  0.1   0:00.05 gb
27047 brandon   20   0 1054m 326m 2848 S 123.9  2.1   0:01.30 gb
27047 brandon   20   0 1343m 349m 2852 S 114.8  2.2   0:02.46 gb
27047 brandon   20   0 1343m 349m 2852 S 100.0  2.2   0:03.47 gb
27047 brandon   20   0 1343m 350m 2852 S 125.1  2.2   0:04.74 gb
27047 brandon   20   0 1343m 350m 2852 S 134.4  2.2   0:06.10 gb
27047 brandon   20   0 1343m 350m 2852 S 132.8  2.2   0:07.44 gb
27047 brandon   20   0 1343m 350m 2852 S 124.7  2.2   0:08.70 gb
27047 brandon   20   0 1343m 351m 2852 S  95.9  2.2   0:09.67 gb
27047 brandon   20   0 1487m 351m 2852 S  99.0  2.2   0:10.67 gb
27047 brandon   20   0 1487m 351m 2852 S 109.6  2.2   0:11.78 gb
27047 brandon   20   0 1487m 351m 2852 S  85.2  2.2   0:12.64 gb
27047 brandon   20   0 1488m 353m 2852 S 104.5  2.2   0:13.70 gb
27047 brandon   20   0 1488m 353m 2852 S 128.1  2.2   0:15.00 gb
27047 brandon   20   0 1488m 353m 2852 S 129.6  2.2   0:16.31 gb
27047 brandon   20   0 1488m 353m 2852 S 129.3  2.2   0:17.62 gb
27047 brandon   20   0 1488m 354m 2852 S 130.6  2.2   0:18.94 gb
27047 brandon   20   0 1488m 354m 2852 S 130.7  2.2   0:20.26 gb
27047 brandon   20   0 1634m 356m 2852 S 127.5  2.3   0:21.55 gb
27047 brandon   20   0 1634m 356m 2852 S 131.5  2.3   0:22.88 gb

用內建的 profiling 工具來找程式熱點

$ go tool pprof ./gb cpu-27047.pprof 
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 1569 samples
     183  11.7%  11.7%      356  22.7% scanblock
      74   4.7%  16.4%       74   4.7% runtime.memmove
      60   3.8%  20.2%       60   3.8% runtime.casp
      59   3.8%  24.0%       59   3.8% runtime.xchg
      51   3.3%  27.2%       72   4.6% sweepspan
      50   3.2%  30.4%       50   3.2% runtime.memclr
      45   2.9%  33.3%       56   3.6% markonly
      39   2.5%  35.8%       66   4.2% flushptrbuf
      34   2.2%  37.9%       55   3.5% runtime.markallocated
      32   2.0%  40.0%       32   2.0% net/textproto.canonicalMIMEHeaderKey

scanblock method 佔用了 11% 的 cpu 資源,來查看看

(pprof) web scanblock
Total: 1569 samples
After focusing on 'scanblock': 356 samples of 1569 (22.7%)
Dropping nodes with <= 1 samples; edges with <= 0 abs(samples)
Loading web page file:////tmp/pprof27099.1.svg
Created new window in existing browser session.
(pprof)       

SVG File

按 scanblock 上方粗線追朔,看起來是 makeslice 建立許多 slice 使的 gc 很忙, 延粗線再往上找到 main package 的 main.(*HttpWorker).send 可能是問題來源

(pprof) list main.*.send
Total: 1569 samples
ROUTINE ====================== main.(*HttpWorker).send in /home/brandon/workspaces/go/src/bench/gb/http.go
     7    798 Total samples (flat / cumulative)
     .      .   49:         }
     .      .   50:     }
     .      .   51: 
     .      .   52: }
     .      .   53: 
---
     .      .   54: func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
     .      .   55: 
     .      4   56:     record := &Record{}
     .      .   57:     sw := &StopWatch{}
     .      .   58:     sw.Start()
     .      .   59: 
     .      .   60:     var contentSize int
     .      .   61: 
     .      .   62:     defer func() {
     .      .   63:         if r := recover(); r != nil {
     .      .   64:             record.Error = &ExceptionError{errors.New(fmt.Sprint(r))}
     .      .   65:         } else {
     .      .   66:             record.contentSize = contentSize
     .      .   67:             record.responseTime = sw.Elapsed
     .      .   68:         }
     .      .   69: 
     .      .   70:         if record.Error != nil {
     .      .   71:             TraceException(record.Error)
     .      .   72:         }
     .      .   73: 
     .      .   74:         executionResult <- record
     1      1   75:     }()
     .      .   76: 
     1    188   77:     resp, err := h.client.Do(request)
     .      .   78:     if err != nil {
     .     21   79:         record.Error = &ConnectError{err}
     .      .   80:         return
     .      .   81:     } else {
     .      3   82:         defer resp.Body.Close()
     .      .   83: 
     .      .   84:         if resp.StatusCode < 200 || resp.StatusCode > 300 {
     .      .   85:             record.Error = &ResponseError{err}
     .      .   86:             return
     .      .   87:         }
     .      .   88: 
     1    575   89:         body, err := ioutil.ReadAll(resp.Body)
     .      .   90: 
     .      .   91:         if err != nil {
     .      .   92:             record.Error = &ReceiveError{err}
     .      .   93:             return
     .      .   94:         }
     .      .   95:         contentSize = len(body)
     .      .   96: 
     4      4   97:         if contentSize != h.config.contentSize {
     .      .   98:             record.Error = &LengthError{invalidContnetSize}
     .      .   99:             return
     .      .  100:         }
     .      .  101: 
     .      .  102:     }
     .      .  103: 
     .      .  104:     sw.Stop()
     .      .  105: 
     .      2  106: }
---
     .      .  107: 
     .      .  108: func detectHost(config *Config) error {
     .      .  109:     var err error
     .      .  110:     defer func() {
     .      .  111:         if r := recover(); r != nil {
(pprof) 

ioutil.ReadAll(resp.Body) 內部佔用了最長的時間,可是沒看到有 make([]) 之類的語句

(pprof) list ioutil.ReadAll
(略)

(pprof) list ioutil.readAll
Total: 1569 samples
ROUTINE ====================== io/ioutil.readAll in /usr/lib/go/src/pkg/io/ioutil/ioutil.go
     4    574 Total samples (flat / cumulative)
     .      .   12:     "sort"
     .      .   13: )
     .      .   14: 
     .      .   15: // readAll reads from r until an error or EOF and returns the data it read
     .      .   16: // from the internal buffer allocated with a specified capacity.
---
     1      1   17: func readAll(r io.Reader, capacity int64) (b []byte, err error) {
     1     41   18:     buf := bytes.NewBuffer(make([]byte, 0, capacity))
     .      .   19:     // If the buffer overflows, we will get bytes.ErrTooLarge.
     .      .   20:     // Return that as an error. Any other panic remains.
     .      .   21:     defer func() {
     .      .   22:         e := recover()
     .      .   23:         if e == nil {
     .      .   24:             return
     .      .   25:         }
     .      .   26:         if panicErr, ok := e.(error); ok && panicErr == bytes.ErrTooLarge {
     .      .   27:             err = panicErr
     .      .   28:         } else {
     .      .   29:             panic(e)
     .      .   30:         }
     1      2   31:     }()
     1    527   32:     _, err = buf.ReadFrom(r)
     .      1   33:     return buf.Bytes(), err
     .      2   34: }
---
     .      .   35: 
     .      .   36: // ReadAll reads from r until an error or EOF and returns the data it read.
     .      .   37: // A successful call returns err == nil, not err == EOF. Because ReadAll is
     .      .   38: // defined to read from src until EOF, it does not treat an EOF from Read
     .      .   39: // as an error to be reported.
(pprof) list bytes.*ReadFrom
Total: 1569 samples
ROUTINE ====================== bytes.(*Buffer).ReadFrom in /usr/lib/go/src/pkg/bytes/buffer.go
     5    526 Total samples (flat / cumulative)
     .      .  145: 
     .      .  146: // ReadFrom reads data from r until EOF and appends it to the buffer, growing
     .      .  147: // the buffer as needed. The return value n is the number of bytes read. Any
     .      .  148: // error except io.EOF encountered during the read is also returned. If the
     .      .  149: // buffer becomes too large, ReadFrom will panic with ErrTooLarge.
---
     .      .  150: func (b *Buffer) ReadFrom(r io.Reader) (n int64, err error) {
     .      .  151:     b.lastRead = opInvalid
     .      .  152:     // If buffer is empty, reset to recover space.
     .      .  153:     if b.off >= len(b.buf) {
     1      2  154:         b.Truncate(0)
     .      .  155:     }
     .      .  156:     for {
     .      .  157:         if free := cap(b.buf) - len(b.buf); free < MinRead {
     .      .  158:             // not enough space at end
     .      .  159:             newBuf := b.buf
     .      .  160:             if b.off+free < MinRead {
     .      .  161:                 // not enough space using beginning of buffer;
     .      .  162:                 // double buffer capacity
     2    379  163:                 newBuf = makeSlice(2*cap(b.buf) + MinRead)
     .      .  164:             }
     .     13  165:             copy(newBuf, b.buf[b.off:])
     .      .  166:             b.buf = newBuf[:len(b.buf)-b.off]
     .      .  167:             b.off = 0
     .      .  168:         }
     1    131  169:         m, e := r.Read(b.buf[len(b.buf):cap(b.buf)])
     .      .  170:         b.buf = b.buf[0 : len(b.buf)+m]
     .      .  171:         n += int64(m)
     1      1  172:         if e == io.EOF {
     .      .  173:             break
     .      .  174:         }
     .      .  175:         if e != nil {
     .      .  176:             return n, e
     .      .  177:         }
     .      .  178:     }
     .      .  179:     return n, nil // err is EOF, so return nil explicitly
---
     .      .  180: }
     .      .  181: 
     .      .  182: // makeSlice allocates a slice of size n. If the allocation fails, it panics
     .      .  183: // with ErrTooLarge.
     .      .  184: func makeSlice(n int) []byte {

ioutil.readAll 的 L:18 可以分析問題是 main.(*HttpWorker).send 在每回呼叫 ioutil.ReadAll 都會動態產生一塊 Buffer 若在讀取資料的時候遇到 buffer size 不足,就再 makeSlice 產生更大的 buffer(bytes.ReadFrom:163)

為了避免一直創建 buffer 做了以下修改

type HttpWorker struct {
        // ignored
    readBuf   *bytes.Buffer
}

func NewHttpWorker(...ignored...) *HttpWorker {
    buf := bytes.NewBuffer(make([]byte, 0, config.contentSize+bytes.MinRead))
    return &HttpWorker{...ignored..., buf}
}

func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
    // ignored

    resp, _ := h.client.Do(request)

    defer resp.Body.Close()
    defer h.readBuf.Reset()

    var err error
    contentSize, err = h.readBuf.ReadFrom(resp.Body)

    // ignored
}

執行結果

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat
(略)
Document Path:          /10k.dat
Document Length:        10240 bytes

Concurrency Level:      100
Time taken for tests:   11.70 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    17094.59 [#/sec] (mean)
Time per request:       5.850 [ms] (mean)
Time per request:       0.058 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     170945.87 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   5.16    4   118

Percentage of the requests served within a certain time (ms)
 50%     4
 66%     5
 75%     6
 80%     7
 90%     9
 95%     10
 98%     12
 99%     14
 100%    118 (longest request)

NumGC:12 Pause:58.63ms Pause(Avg):40.91ms Overhead:3.96% Alloc:84.97M Sys:611.67M Alloc(Rate):77.23M/s Histogram:99.82ms 99.82ms 99.82ms
$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
27989 brandon   20   0  753m 176m 2844 S  57.9  1.1   0:00.59 gb
27989 brandon   20   0 1012m 234m 2852 S 116.8  1.5   0:01.78 gb
27989 brandon   20   0 1058m 280m 2852 S 107.6  1.8   0:02.87 gb
27989 brandon   20   0 1105m 327m 2852 S 107.7  2.1   0:03.96 gb
27989 brandon   20   0 1194m 346m 2852 S 116.9  2.2   0:05.14 gb
27989 brandon   20   0 1195m 347m 2852 S 107.0  2.2   0:06.22 gb
27989 brandon   20   0 1195m 347m 2852 S 103.9  2.2   0:07.27 gb
27989 brandon   20   0 1275m 348m 2852 S 116.6  2.2   0:08.45 gb
27989 brandon   20   0 1277m 350m 2852 S 111.7  2.2   0:09.58 gb
27989 brandon   20   0 1277m 350m 2852 S 110.8  2.2   0:10.70 gb
27989 brandon   20   0 1349m 351m 2852 S 113.6  2.2   0:11.85 gb
27989 brandon   20   0 1359m 353m 2852 S 115.9  2.2   0:13.02 gb
$ go tool pprof ./gb cpu-27989.pprof 
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 967 samples
      61   6.3%   6.3%       61   6.3% runtime.xchg
      44   4.6%  10.9%       44   4.6% runtime.memmove
      34   3.5%  14.4%       34   3.5% net/textproto.canonicalMIMEHeaderKey
      29   3.0%  17.4%       31   3.2% runtime.settype_flush
      27   2.8%  20.2%      227  23.5% runtime.mallocgc
      26   2.7%  22.9%       26   2.7% runtime.casp
      25   2.6%  25.4%       87   9.0% runtime.MCache_Alloc
      25   2.6%  28.0%       46   4.8% runtime.markallocated
      23   2.4%  30.4%       23   2.4% bytes.IndexByte
      23   2.4%  32.8%       92   9.5% selectgo

跟改變前相比 Alloc(Rate):346.57M/s 降到 Alloc(Rate):77.23M/s CPU 利用率也略有下降,相關 gc 停頓數值下降,throughput 也有較大的提升, 原先 top10 裡佔據 11.7% scanblock 也消失了

UPDATE

更省記憶體的改良辦法

const MaxBufferSize    = 8192

type HTTPWorker struct {
    //ignored
    discard   io.ReaderFrom
}

func NewHttpWorker(...ignored...) *HttpWorker {
    var buf []byte
    contentSize := context.GetInt(FieldContentSize)
    if contentSize < MaxBufferSize {
        buf = make([]byte, contentSize)
    } else {
        buf = make([]byte, MaxBufferSize)
    }

    return &HTTPWorker{
        //ignored
        &Discard{buf},
    }
}

func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
    // ignored

    resp, _ := h.client.Do(request)

    defer resp.Body.Close()
    contentSize, err = h.discard.ReadFrom(resp.Body)

    // ignored
}

type Discard struct {
    blackHole []byte
}

func (d *Discard) ReadFrom(r io.Reader) (n int64, err error) {
    readSize := 0
    for {
        readSize, err = r.Read(d.blackHole)
        n += int64(readSize)
        if err != nil {
            if err == io.EOF {
                return n, nil
            }
            return
        }
    }
}

附件 profiling 工具

No comments: