Apr 18, 2013

golang - Profiling Go Programs 記錄(Part 2)

2. 程式化停止 Timer

$ go tool pprof ./gb cpu-27989.pprof 
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 967 samples
      61   6.3%   6.3%       61   6.3% runtime.xchg
      44   4.6%  10.9%       44   4.6% runtime.memmove
      34   3.5%  14.4%       34   3.5% net/textproto.canonicalMIMEHeaderKey
      29   3.0%  17.4%       31   3.2% runtime.settype_flush
      27   2.8%  20.2%      227  23.5% runtime.mallocgc
      26   2.7%  22.9%       26   2.7% runtime.casp
      25   2.6%  25.4%       87   9.0% runtime.MCache_Alloc
      25   2.6%  28.0%       46   4.8% runtime.markallocated
      23   2.4%  30.4%       23   2.4% bytes.IndexByte
      23   2.4%  32.8%       92   9.5% selectgo
$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
27989 brandon   20   0  753m 176m 2844 S  57.9  1.1   0:00.59 gb
27989 brandon   20   0 1012m 234m 2852 S 116.8  1.5   0:01.78 gb
27989 brandon   20   0 1058m 280m 2852 S 107.6  1.8   0:02.87 gb
27989 brandon   20   0 1105m 327m 2852 S 107.7  2.1   0:03.96 gb
27989 brandon   20   0 1194m 346m 2852 S 116.9  2.2   0:05.14 gb
27989 brandon   20   0 1195m 347m 2852 S 107.0  2.2   0:06.22 gb
27989 brandon   20   0 1195m 347m 2852 S 103.9  2.2   0:07.27 gb
27989 brandon   20   0 1275m 348m 2852 S 116.6  2.2   0:08.45 gb
27989 brandon   20   0 1277m 350m 2852 S 111.7  2.2   0:09.58 gb
27989 brandon   20   0 1277m 350m 2852 S 110.8  2.2   0:10.70 gb
27989 brandon   20   0 1349m 351m 2852 S 113.6  2.2   0:11.85 gb
27989 brandon   20   0 1359m 353m 2852 S 115.9  2.2   0:13.02 gb

這時的 cpu profiling 已經比較分散平均,看不太出明顯熱點,另外從 top 數據來看 350m 的記憶體用量有些高,改用 heap profiler 觀察看看是否有 memory leak 的狀況

在程式內加入 Mem Profiling 程式

func main() {
    SetMemProfileRate(1) //(請參考附件 profilingtool.go)
    // ignored
    GC()    
    DumpHeap()
    PrintGCSummary()
}
$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat
(略)
Concurrency Level:      100
Time taken for tests:   25.33 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    7896.70 [#/sec] (mean)
Time per request:       12.664 [ms] (mean)
Time per request:       0.127 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     78966.96 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   76.91   6   1866

Percentage of the requests served within a certain time (ms)
 50%     6
 66%     8
 75%     10
 80%     11
 90%     14
 95%     16
 98%     18
 99%     21
 100%    1866 (longest request)

NumGC:13 Pause:731.86ms Pause(Avg):604.31ms Overhead:27.92% Alloc:46.39M Sys:696.17M Alloc(Rate):33.91M/s Histogram:1.85s 1.85s 1.85s

這裡用設定的 sample rate 比較小,搜集的資料比較完整,相對也比較多效能消耗,這裡的平均停頓時間就高了許多(604.31ms vs 40.91ms)

brandon@brandon-laptop:~/workspaces/go/src/bench/gb$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
29236 brandon   20   0  487m  48m 2944 S  48.1  0.3   0:00.49 gb
29236 brandon   20   0  595m 156m 2944 S 102.1  1.0   0:01.52 gb
29236 brandon   20   0  808m 233m 2956 S 123.9  1.5   0:02.77 gb
29236 brandon   20   0  916m 278m 2956 S 137.8  1.8   0:04.16 gb
29236 brandon   20   0  960m 322m 2956 S 137.7  2.0   0:05.55 gb
29236 brandon   20   0 1004m 366m 2956 S 135.7  2.3   0:06.92 gb
29236 brandon   20   0 1047m 410m 2956 S 136.7  2.6   0:08.30 gb
29236 brandon   20   0 1072m 434m 2956 S 153.2  2.7   0:09.85 gb
29236 brandon   20   0 1072m 434m 2956 S 171.4  2.7   0:11.58 gb
29236 brandon   20   0 1072m 434m 2956 S 141.8  2.7   0:13.01 gb
29236 brandon   20   0 1072m 435m 2956 S 137.7  2.8   0:14.40 gb
29236 brandon   20   0 1072m 436m 2956 S 132.6  2.8   0:15.74 gb
29236 brandon   20   0 1072m 436m 2956 S 135.7  2.8   0:17.11 gb
29236 brandon   20   0 1072m 436m 2956 S 135.3  2.8   0:18.48 gb
29236 brandon   20   0 1072m 436m 2956 S 168.4  2.8   0:20.18 gb
29236 brandon   20   0 1072m 436m 2956 S 153.7  2.8   0:21.73 gb
29236 brandon   20   0 1073m 438m 2956 S 126.6  2.8   0:23.01 gb
29236 brandon   20   0 1073m 438m 2956 S 130.8  2.8   0:24.33 gb
29236 brandon   20   0 1073m 438m 2956 S 132.8  2.8   0:25.67 gb
29236 brandon   20   0 1073m 438m 2956 S 136.6  2.8   0:27.05 gb
29236 brandon   20   0 1073m 438m 2956 S 156.5  2.8   0:28.63 gb
29236 brandon   20   0 1073m 438m 2956 S 143.8  2.8   0:30.08 gb
29236 brandon   20   0 1073m 439m 2956 S 126.5  2.8   0:31.36 gb
29236 brandon   20   0 1076m 441m 2956 S 134.6  2.8   0:32.72 gb
29236 brandon   20   0 1076m 441m 2956 S 140.6  2.8   0:34.14 gb
29236 brandon   20   0 1076m 441m 2956 S 153.5  2.8   0:35.69 gb
29236 brandon   20   0 1076m 441m 2956 S 139.6  2.8   0:37.10 gb
29236 brandon   20   0 1076m 441m 2956 S 131.7  2.8   0:38.43 gb
$ go tool pprof ./gb heap-29236-1.pprof
(pprof) top10
Total: 42.4 MB
    33.6  79.1%  79.1%     35.6  83.9% time.NewTimer
     2.0   4.8%  83.9%      2.0   4.8% addtimer
     1.5   3.6%  87.5%      1.5   3.6% main.NewBenchmark
     1.5   3.6%  91.1%      2.9   6.7% main.(*Benchmark).Run
     1.2   2.8%  93.9%      1.2   2.9% main.NewHttpWorker
     0.6   1.5%  95.4%      0.6   1.5% newdefer
     0.6   1.4%  96.8%      0.6   1.4% bufio.NewReaderSize
     0.4   1.0%  97.8%      0.4   1.0% bufio.NewWriterSize
     0.3   0.7%  98.5%      1.6   3.8% net/http.(*Transport).dialConn
     0.1   0.3%  98.8%      0.1   0.3% runtime.malg

程式要結束前還有 42.4 MB 沒被回收,有 memory leak

(pprof) web time.NewTimer

SVG File

從圖中來看 timer.NewTimer 是 main.(*HttpWorker).Run 裡面的 time.After(d timer.Duration) 建立的

(pprof) list main.\(\*HttpWorker\).Run
Total: 42.4 MB
ROUTINE ====================== main.(*HttpWorker).Run in /home/brandon/workspaces/go/src/bench/gb/http.go
   0.0   35.6 Total MB (flat / cumulative)
     .      .   29: 
     .      .   30: var (
     .      .   31:     invalidContnetSize = errors.New("invalid content size")
     .      .   32: )
     .      .   33: 
---
     .      .   34: func (h *HttpWorker) Run() {
     .      .   35:     <-h.start
     .      .   36: 
     .      .   37:     for job := range h.jobs {
     .      .   38: 
   0.0    0.0   39:         executionResult := make(chan *Record)
     .      .   40: 
     .    0.0   41:         go h.send(job, executionResult)
     .      .   42: 
     .      .   43:         select {
     .      .   44:         case record := <-executionResult:
     .      .   45:             h.collector <- record
     .      .   46: 
     .   35.6   47:         case <-time.After(time.Duration(MAX_RESPONSE_TIMEOUT) * time.Second):
     .      .   48:             h.collector <- &Record{Error: errors.New("execution timeout")}
     .      .   49: 
     .      .   50:         case <-h.stop:
     .      .   51:             return
     .      .   52:         }
     .      .   53:     }
     .      .   54: 
     .      .   55: }
---
     .      .   56: 
     .      .   57: func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
     .      .   58: 
     .      .   59:     record := &Record{}
     .      .   60:     sw := &StopWatch{}

select 裡包 time.After(Duration)這種 timeout control用法,在官方文件裡有很多地方都有使用,追查過後才知道,這方法使用的 Timer 等時間過期就會被 GC 回收了,但這裡的狀況是需要對每組 http request/response 各別做 timeout control,會大量使用到 Timer 若要放給 GC 回收會有點浪費記憶體,也會額外增加 GC 掃描的時間

修正方法

func (h *HttpWorker) Run() {
    <-h.start

    for job := range h.jobs {

        timeout := time.NewTimer(time.Duration(MAX_RESPONSE_TIMEOUT) * time.Second)
        executionResult := make(chan *Record)

        go h.send(job, executionResult)

        select {
        case record := <-executionResult:
            h.collector <- record

        case <-timeout.C:
            h.collector <- &Record{Error: errors.New("execution timeout")}

        case <-h.stop:
            timeout.Stop()  
            return
        }
        timeout.Stop()      
    }

}

執行結果

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat 
$ go tool pprof ./gb heap-6438-1.pprof
(pprof) top10
Total: 8.6 MB
     1.5  17.8%  17.8%      3.0  35.3% main.(*Benchmark).Run
     1.5  17.8%  35.6%      1.5  17.8% main.(*Monitor).Run
     1.5  17.8%  53.4%      1.5  17.8% main.NewBenchmark
     1.5  17.5%  70.9%      1.5  17.5% net/textproto.MIMEHeader.Set
     1.0  11.7%  82.5%      1.0  11.7% bufio.NewReaderSize
     1.0  11.7%  94.2%      1.0  11.7% bufio.NewWriterSize
     0.5   5.8% 100.0%      0.5   5.8% newdefer
     0.0   0.0% 100.0%      0.5   5.8% bufio.(*Writer).Flush
     0.0   0.0% 100.0%      1.0  11.7% bufio.NewReader
     0.0   0.0% 100.0%      1.0  11.7% bufio.NewWriter

time.NewTimer 從清單中消失了,接下來還有 8.6 MB 的空間在程式結束前沒被回收,下一篇繼續來追蹤

golang - Profiling Go Programs 記錄(Part 1)

利用了 profiling 工具解決了一些 Go-HttpBench 的 bug 並做了些效能優化

以 ./gb -c 100 -n 200000 -k http://localhost/10k.dat 來比較版本 0.1.00.1.6 的結果(based on Go1.1beta2)

        Mem(RES)   Requests/s   Alloc(Rate)   GC-Overhead   Pause(Avg)   Pause(Max)
0.1.0       354m     12119.99     397.00M/s        25.40%      56.72ms     153.25ms
0.1.6        23m     16976.76      67.48M/s         2.50%       5.12ms      10.07ms
delta(%)    ↓93%         ↑40%          ↓83%        ↓22.9%         ↓91%         ↓93%

Response time   0.1.0   0.1.6
 50%                4       4
 66%                6       5
 75%                7       6
 80%                8       7
 90%               10       9
 95%               12      11
 98%               57      13
 99%               70      15
 100%             179     133

優化過程主要做了下列四項改變,以下將包含總結分五個章節來發佈

  1. 重複使用 Buffer
  2. 程式化停止 Timer
  3. 正常結束 Goroutine
  4. 快取計算結果
  5. 結論

1. 重複使用 Buffer

先在程式內加入 CPU Profiling 程式,分析有沒有明顯的效能瓶頸 (用的指令請參考附件 profilingtool.go)

func main() {
    StartCPUProfile()
    // ignored
    StopCPUProfile()
    PrintGCSummary()
}
$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat

This is GoHttpBench, Version 0.1.0, https://github.com/parkghost/gohttpbench
Author: Brandon Chen, Email: parkghost@gmail.com
Licensed under the Apache License, Version 2.0

Benchmarking localhost (be patient)
Completed 20000 requests
Completed 40000 requests
Completed 60000 requests
Completed 80000 requests
Completed 100000 requests
Completed 120000 requests
Completed 140000 requests
Completed 160000 requests
Completed 180000 requests
Completed 200000 requests
Finished 200000 requests

Server Software:        nginx/1.2.1
Server Hostname:        localhost
Server Port:            80

Document Path:          /10k.dat
Document Length:        10240 bytes

Concurrency Level:      100
Time taken for tests:   19.00 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    10526.49 [#/sec] (mean)
Time per request:       9.500 [ms] (mean)
Time per request:       0.095 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     105264.94 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   12.79   5   164

Percentage of the requests served within a certain time (ms)
 50%     5
 66%     7
 75%     8
 80%     9
 90%     12
 95%     16
 98%     60
 99%     77
 100%    164 (longest request)

NumGC:77 Pause:41.17ms Pause(Avg):61.92ms Overhead:24.21% Alloc:48.57M Sys:614.81M Alloc(Rate):346.57M/s Histogram:108.41ms 124.72ms 124.72ms 

用 top 搜集 gb 相關系統資源使用狀況

$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
27047 brandon   20   0  597m  10m 2840 S   4.9  0.1   0:00.05 gb
27047 brandon   20   0 1054m 326m 2848 S 123.9  2.1   0:01.30 gb
27047 brandon   20   0 1343m 349m 2852 S 114.8  2.2   0:02.46 gb
27047 brandon   20   0 1343m 349m 2852 S 100.0  2.2   0:03.47 gb
27047 brandon   20   0 1343m 350m 2852 S 125.1  2.2   0:04.74 gb
27047 brandon   20   0 1343m 350m 2852 S 134.4  2.2   0:06.10 gb
27047 brandon   20   0 1343m 350m 2852 S 132.8  2.2   0:07.44 gb
27047 brandon   20   0 1343m 350m 2852 S 124.7  2.2   0:08.70 gb
27047 brandon   20   0 1343m 351m 2852 S  95.9  2.2   0:09.67 gb
27047 brandon   20   0 1487m 351m 2852 S  99.0  2.2   0:10.67 gb
27047 brandon   20   0 1487m 351m 2852 S 109.6  2.2   0:11.78 gb
27047 brandon   20   0 1487m 351m 2852 S  85.2  2.2   0:12.64 gb
27047 brandon   20   0 1488m 353m 2852 S 104.5  2.2   0:13.70 gb
27047 brandon   20   0 1488m 353m 2852 S 128.1  2.2   0:15.00 gb
27047 brandon   20   0 1488m 353m 2852 S 129.6  2.2   0:16.31 gb
27047 brandon   20   0 1488m 353m 2852 S 129.3  2.2   0:17.62 gb
27047 brandon   20   0 1488m 354m 2852 S 130.6  2.2   0:18.94 gb
27047 brandon   20   0 1488m 354m 2852 S 130.7  2.2   0:20.26 gb
27047 brandon   20   0 1634m 356m 2852 S 127.5  2.3   0:21.55 gb
27047 brandon   20   0 1634m 356m 2852 S 131.5  2.3   0:22.88 gb

用內建的 profiling 工具來找程式熱點

$ go tool pprof ./gb cpu-27047.pprof 
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 1569 samples
     183  11.7%  11.7%      356  22.7% scanblock
      74   4.7%  16.4%       74   4.7% runtime.memmove
      60   3.8%  20.2%       60   3.8% runtime.casp
      59   3.8%  24.0%       59   3.8% runtime.xchg
      51   3.3%  27.2%       72   4.6% sweepspan
      50   3.2%  30.4%       50   3.2% runtime.memclr
      45   2.9%  33.3%       56   3.6% markonly
      39   2.5%  35.8%       66   4.2% flushptrbuf
      34   2.2%  37.9%       55   3.5% runtime.markallocated
      32   2.0%  40.0%       32   2.0% net/textproto.canonicalMIMEHeaderKey

scanblock method 佔用了 11% 的 cpu 資源,來查看看

(pprof) web scanblock
Total: 1569 samples
After focusing on 'scanblock': 356 samples of 1569 (22.7%)
Dropping nodes with <= 1 samples; edges with <= 0 abs(samples)
Loading web page file:////tmp/pprof27099.1.svg
Created new window in existing browser session.
(pprof)       

SVG File

按 scanblock 上方粗線追朔,看起來是 makeslice 建立許多 slice 使的 gc 很忙, 延粗線再往上找到 main package 的 main.(*HttpWorker).send 可能是問題來源

(pprof) list main.*.send
Total: 1569 samples
ROUTINE ====================== main.(*HttpWorker).send in /home/brandon/workspaces/go/src/bench/gb/http.go
     7    798 Total samples (flat / cumulative)
     .      .   49:         }
     .      .   50:     }
     .      .   51: 
     .      .   52: }
     .      .   53: 
---
     .      .   54: func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
     .      .   55: 
     .      4   56:     record := &Record{}
     .      .   57:     sw := &StopWatch{}
     .      .   58:     sw.Start()
     .      .   59: 
     .      .   60:     var contentSize int
     .      .   61: 
     .      .   62:     defer func() {
     .      .   63:         if r := recover(); r != nil {
     .      .   64:             record.Error = &ExceptionError{errors.New(fmt.Sprint(r))}
     .      .   65:         } else {
     .      .   66:             record.contentSize = contentSize
     .      .   67:             record.responseTime = sw.Elapsed
     .      .   68:         }
     .      .   69: 
     .      .   70:         if record.Error != nil {
     .      .   71:             TraceException(record.Error)
     .      .   72:         }
     .      .   73: 
     .      .   74:         executionResult <- record
     1      1   75:     }()
     .      .   76: 
     1    188   77:     resp, err := h.client.Do(request)
     .      .   78:     if err != nil {
     .     21   79:         record.Error = &ConnectError{err}
     .      .   80:         return
     .      .   81:     } else {
     .      3   82:         defer resp.Body.Close()
     .      .   83: 
     .      .   84:         if resp.StatusCode < 200 || resp.StatusCode > 300 {
     .      .   85:             record.Error = &ResponseError{err}
     .      .   86:             return
     .      .   87:         }
     .      .   88: 
     1    575   89:         body, err := ioutil.ReadAll(resp.Body)
     .      .   90: 
     .      .   91:         if err != nil {
     .      .   92:             record.Error = &ReceiveError{err}
     .      .   93:             return
     .      .   94:         }
     .      .   95:         contentSize = len(body)
     .      .   96: 
     4      4   97:         if contentSize != h.config.contentSize {
     .      .   98:             record.Error = &LengthError{invalidContnetSize}
     .      .   99:             return
     .      .  100:         }
     .      .  101: 
     .      .  102:     }
     .      .  103: 
     .      .  104:     sw.Stop()
     .      .  105: 
     .      2  106: }
---
     .      .  107: 
     .      .  108: func detectHost(config *Config) error {
     .      .  109:     var err error
     .      .  110:     defer func() {
     .      .  111:         if r := recover(); r != nil {
(pprof) 

ioutil.ReadAll(resp.Body) 內部佔用了最長的時間,可是沒看到有 make([]) 之類的語句

(pprof) list ioutil.ReadAll
(略)

(pprof) list ioutil.readAll
Total: 1569 samples
ROUTINE ====================== io/ioutil.readAll in /usr/lib/go/src/pkg/io/ioutil/ioutil.go
     4    574 Total samples (flat / cumulative)
     .      .   12:     "sort"
     .      .   13: )
     .      .   14: 
     .      .   15: // readAll reads from r until an error or EOF and returns the data it read
     .      .   16: // from the internal buffer allocated with a specified capacity.
---
     1      1   17: func readAll(r io.Reader, capacity int64) (b []byte, err error) {
     1     41   18:     buf := bytes.NewBuffer(make([]byte, 0, capacity))
     .      .   19:     // If the buffer overflows, we will get bytes.ErrTooLarge.
     .      .   20:     // Return that as an error. Any other panic remains.
     .      .   21:     defer func() {
     .      .   22:         e := recover()
     .      .   23:         if e == nil {
     .      .   24:             return
     .      .   25:         }
     .      .   26:         if panicErr, ok := e.(error); ok && panicErr == bytes.ErrTooLarge {
     .      .   27:             err = panicErr
     .      .   28:         } else {
     .      .   29:             panic(e)
     .      .   30:         }
     1      2   31:     }()
     1    527   32:     _, err = buf.ReadFrom(r)
     .      1   33:     return buf.Bytes(), err
     .      2   34: }
---
     .      .   35: 
     .      .   36: // ReadAll reads from r until an error or EOF and returns the data it read.
     .      .   37: // A successful call returns err == nil, not err == EOF. Because ReadAll is
     .      .   38: // defined to read from src until EOF, it does not treat an EOF from Read
     .      .   39: // as an error to be reported.
(pprof) list bytes.*ReadFrom
Total: 1569 samples
ROUTINE ====================== bytes.(*Buffer).ReadFrom in /usr/lib/go/src/pkg/bytes/buffer.go
     5    526 Total samples (flat / cumulative)
     .      .  145: 
     .      .  146: // ReadFrom reads data from r until EOF and appends it to the buffer, growing
     .      .  147: // the buffer as needed. The return value n is the number of bytes read. Any
     .      .  148: // error except io.EOF encountered during the read is also returned. If the
     .      .  149: // buffer becomes too large, ReadFrom will panic with ErrTooLarge.
---
     .      .  150: func (b *Buffer) ReadFrom(r io.Reader) (n int64, err error) {
     .      .  151:     b.lastRead = opInvalid
     .      .  152:     // If buffer is empty, reset to recover space.
     .      .  153:     if b.off >= len(b.buf) {
     1      2  154:         b.Truncate(0)
     .      .  155:     }
     .      .  156:     for {
     .      .  157:         if free := cap(b.buf) - len(b.buf); free < MinRead {
     .      .  158:             // not enough space at end
     .      .  159:             newBuf := b.buf
     .      .  160:             if b.off+free < MinRead {
     .      .  161:                 // not enough space using beginning of buffer;
     .      .  162:                 // double buffer capacity
     2    379  163:                 newBuf = makeSlice(2*cap(b.buf) + MinRead)
     .      .  164:             }
     .     13  165:             copy(newBuf, b.buf[b.off:])
     .      .  166:             b.buf = newBuf[:len(b.buf)-b.off]
     .      .  167:             b.off = 0
     .      .  168:         }
     1    131  169:         m, e := r.Read(b.buf[len(b.buf):cap(b.buf)])
     .      .  170:         b.buf = b.buf[0 : len(b.buf)+m]
     .      .  171:         n += int64(m)
     1      1  172:         if e == io.EOF {
     .      .  173:             break
     .      .  174:         }
     .      .  175:         if e != nil {
     .      .  176:             return n, e
     .      .  177:         }
     .      .  178:     }
     .      .  179:     return n, nil // err is EOF, so return nil explicitly
---
     .      .  180: }
     .      .  181: 
     .      .  182: // makeSlice allocates a slice of size n. If the allocation fails, it panics
     .      .  183: // with ErrTooLarge.
     .      .  184: func makeSlice(n int) []byte {

ioutil.readAll 的 L:18 可以分析問題是 main.(*HttpWorker).send 在每回呼叫 ioutil.ReadAll 都會動態產生一塊 Buffer 若在讀取資料的時候遇到 buffer size 不足,就再 makeSlice 產生更大的 buffer(bytes.ReadFrom:163)

為了避免一直創建 buffer 做了以下修改

type HttpWorker struct {
        // ignored
    readBuf   *bytes.Buffer
}

func NewHttpWorker(...ignored...) *HttpWorker {
    buf := bytes.NewBuffer(make([]byte, 0, config.contentSize+bytes.MinRead))
    return &HttpWorker{...ignored..., buf}
}

func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
    // ignored

    resp, _ := h.client.Do(request)

    defer resp.Body.Close()
    defer h.readBuf.Reset()

    var err error
    contentSize, err = h.readBuf.ReadFrom(resp.Body)

    // ignored
}

執行結果

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat
(略)
Document Path:          /10k.dat
Document Length:        10240 bytes

Concurrency Level:      100
Time taken for tests:   11.70 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    17094.59 [#/sec] (mean)
Time per request:       5.850 [ms] (mean)
Time per request:       0.058 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     170945.87 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   5.16    4   118

Percentage of the requests served within a certain time (ms)
 50%     4
 66%     5
 75%     6
 80%     7
 90%     9
 95%     10
 98%     12
 99%     14
 100%    118 (longest request)

NumGC:12 Pause:58.63ms Pause(Avg):40.91ms Overhead:3.96% Alloc:84.97M Sys:611.67M Alloc(Rate):77.23M/s Histogram:99.82ms 99.82ms 99.82ms
$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
27989 brandon   20   0  753m 176m 2844 S  57.9  1.1   0:00.59 gb
27989 brandon   20   0 1012m 234m 2852 S 116.8  1.5   0:01.78 gb
27989 brandon   20   0 1058m 280m 2852 S 107.6  1.8   0:02.87 gb
27989 brandon   20   0 1105m 327m 2852 S 107.7  2.1   0:03.96 gb
27989 brandon   20   0 1194m 346m 2852 S 116.9  2.2   0:05.14 gb
27989 brandon   20   0 1195m 347m 2852 S 107.0  2.2   0:06.22 gb
27989 brandon   20   0 1195m 347m 2852 S 103.9  2.2   0:07.27 gb
27989 brandon   20   0 1275m 348m 2852 S 116.6  2.2   0:08.45 gb
27989 brandon   20   0 1277m 350m 2852 S 111.7  2.2   0:09.58 gb
27989 brandon   20   0 1277m 350m 2852 S 110.8  2.2   0:10.70 gb
27989 brandon   20   0 1349m 351m 2852 S 113.6  2.2   0:11.85 gb
27989 brandon   20   0 1359m 353m 2852 S 115.9  2.2   0:13.02 gb
$ go tool pprof ./gb cpu-27989.pprof 
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 967 samples
      61   6.3%   6.3%       61   6.3% runtime.xchg
      44   4.6%  10.9%       44   4.6% runtime.memmove
      34   3.5%  14.4%       34   3.5% net/textproto.canonicalMIMEHeaderKey
      29   3.0%  17.4%       31   3.2% runtime.settype_flush
      27   2.8%  20.2%      227  23.5% runtime.mallocgc
      26   2.7%  22.9%       26   2.7% runtime.casp
      25   2.6%  25.4%       87   9.0% runtime.MCache_Alloc
      25   2.6%  28.0%       46   4.8% runtime.markallocated
      23   2.4%  30.4%       23   2.4% bytes.IndexByte
      23   2.4%  32.8%       92   9.5% selectgo

跟改變前相比 Alloc(Rate):346.57M/s 降到 Alloc(Rate):77.23M/s CPU 利用率也略有下降,相關 gc 停頓數值下降,throughput 也有較大的提升, 原先 top10 裡佔據 11.7% scanblock 也消失了

UPDATE

更省記憶體的改良辦法

const MaxBufferSize    = 8192

type HTTPWorker struct {
    //ignored
    discard   io.ReaderFrom
}

func NewHttpWorker(...ignored...) *HttpWorker {
    var buf []byte
    contentSize := context.GetInt(FieldContentSize)
    if contentSize < MaxBufferSize {
        buf = make([]byte, contentSize)
    } else {
        buf = make([]byte, MaxBufferSize)
    }

    return &HTTPWorker{
        //ignored
        &Discard{buf},
    }
}

func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
    // ignored

    resp, _ := h.client.Do(request)

    defer resp.Body.Close()
    contentSize, err = h.discard.ReadFrom(resp.Body)

    // ignored
}

type Discard struct {
    blackHole []byte
}

func (d *Discard) ReadFrom(r io.Reader) (n int64, err error) {
    readSize := 0
    for {
        readSize, err = r.Read(d.blackHole)
        n += int64(readSize)
        if err != nil {
            if err == io.EOF {
                return n, nil
            }
            return
        }
    }
}

附件 profiling 工具

Apr 3, 2013

ab 漸進壓力測試工具

最近在學Go,也用Go寫一個 ab-like 的玩具(Go-HttpBench),為了要比較兩者效能另外寫了 bash script 用來轉換 ab 報表為行列數據,功能跟 autobench 相似,用階段性漸增的壓力來測試 http server,搜集到的數據可再用 gnuplot 或用 excel 繪制圖型

Example:

brandon@brandon-laptop:~/workspaces/go/gb$ ./ab+.sh -k http://localhost/
Start benchmark
#/usr/bin/ab -c CONCURRENCY=(100 to 10000 step 100) -n NUM=(100*CONCURRENCY) -k http://localhost/
#start time,concurrency,complete requests,failed requests,tps,min,mean,stddev,median,max,90%,95%,99%
Wed Apr  3 11:52:29 CST 2013,100,10000,0,31272.48,0,3,5.4,0,32,11,16,22
Wed Apr  3 11:52:29 CST 2013,200,20000,0,32050.72,0,6,14.2,0,113,23,40,64
Wed Apr  3 11:52:30 CST 2013,300,30000,0,33527.94,0,5,11.6,0,207,16,28,53
Wed Apr  3 11:52:31 CST 2013,400,40000,0,32396.87,0,12,25.0,0,189,48,70,105
Wed Apr  3 11:52:32 CST 2013,500,50000,0,32492.75,0,15,35.9,0,391,51,109,166
Wed Apr  3 11:52:34 CST 2013,600,60000,0,33188.85,0,18,38.8,0,448,73,110,166
Wed Apr  3 11:52:36 CST 2013,700,70000,0,33436.38,0,21,46.9,0,483,71,134,221
Wed Apr  3 11:52:38 CST 2013,800,80000,0,33370.64,0,24,54.5,0,1015,74,145,264
Wed Apr  3 11:52:40 CST 2013,900,90000,0,34093.27,0,26,66.4,0,1150,78,133,260
Wed Apr  3 11:52:43 CST 2013,1000,100000,0,34448.17,0,27,68.4,2,1251,85,136,266
Wed Apr  3 11:52:46 CST 2013,1100,110000,0,34072.87,0,32,78.1,0,1129,101,170,368
Wed Apr  3 11:52:49 CST 2013,1200,120000,0,33332.79,0,36,86.2,0,1032,144,228,386
Wed Apr  3 11:52:53 CST 2013,1300,130000,0,34580.66,0,37,96.0,5,3349,116,201,346
Wed Apr  3 11:52:57 CST 2013,1400,140000,0,33776.24,0,41,96.3,0,689,153,262,475
Wed Apr  3 11:53:01 CST 2013,1500,150000,0,31418.26,0,47,114.7,0,1031,164,295,533
Wed Apr  3 11:53:06 CST 2013,1600,160000,0,33285.65,0,48,111.8,0,629,217,340,492
Wed Apr  3 11:53:11 CST 2013,1700,170000,0,33441.14,0,50,121.5,0,825,211,356,559
Wed Apr  3 11:53:16 CST 2013,1800,180000,0,31452.71,0,52,166.0,0,4253,163,294,496
Wed Apr  3 11:53:22 CST 2013,1900,190000,0,33781.89,0,56,139.8,0,1018,234,425,627
Wed Apr  3 11:53:28 CST 2013,2000,200000,0,33766.12,0,59,137.8,1,1335,265,398,624
...


ab+.sh 使用方式跟 ab 一樣,除了 -c 跟 -n 是由 script 動態來調整外,其它 ab 參數可以從命令列尾端代入,另外測試前要修改壓力參數來給定測試壓力範圍
  • LOW_CONCURRENCY 初始的 concurreny 數
  • HIGH_CONCURRENCY 最大的 concurreny 數
  • RATE_CONCURRENCY 每回測試 concurreny 的累加值
  • NUM_CALL 單個 concurrent user 要發送的 request 次數
執行結果會顯示在畫面上並輸出至當前目錄下的 csv 檔 e.g. localhost_20130403_1152.csv

ab+.sh
#!/bin/bash
# ab+
#
# os and network tuning
# ulimit -n 100000
# sudo echo "2048 64512" > /proc/sys/net/ipv4/ip_local_port_range
# sudo echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
# sudo echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
# sudo echo "10" > /proc/sys/net/ipv4/tcp_fin_timeout
# sudo echo "65535" > /proc/sys/net/core/somaxconn
# sudo echo "65535" > /proc/sys/net/ipv4/tcp_max_syn_backlog  

AB_PATH=/usr/bin/ab
SS_PATH=/bin/ss

LOW_CONCURRENCY=100
HIGH_CONCURRENCY=1000
CONCURRENCY_STEP=100
NUM_CALL=100

HOST=$(echo "$*" | perl -ne '/http.*?:\/\/([^\/?:#]*)/; print $1')
PREFIX=$HOST$(date +"_%Y%m%d_%H%M")
OUTPUT_FILE=$PREFIX.csv
MAX_OPEN_FILES=$(ulimit -n)

trap "exit;" SIGINT SIGTERM

if [ "$MAX_OPEN_FILES" -le $HIGH_CONCURRENCY ]; then
  echo "Warning: open file limit < HIGH_CONCURRENCY" 
  exit 1
fi

echo "Starting benchmark"

mkdir -p temp/"$PREFIX"

echo "#$AB_PATH -c CONCURRENCY=($LOW_CONCURRENCY to $HIGH_CONCURRENCY step $CONCURRENCY_STEP) -n NUM=($NUM_CALL*CONCURRENCY) $*" | tee "$OUTPUT_FILE"
echo "#start time,concurrency,complete requests,failed requests,tps,min,mean,stddev,median,max,90%,95%,99%" | tee -a "$OUTPUT_FILE"
for (( concurrency=LOW_CONCURRENCY; concurrency<=HIGH_CONCURRENCY; concurrency+=CONCURRENCY_STEP ))
do
  tempFile="temp/$PREFIX/c$concurrency-n$((NUM_CALL*concurrency)).log"
  echo "$AB_PATH -c $concurrency -n $((NUM_CALL*concurrency)) $*" > "$tempFile"
  
  startTime=$(date)
  $AB_PATH -c $concurrency -n $((NUM_CALL*concurrency)) "$@" >> "$tempFile" 2>&1

  if [ $? -ne 0 ]; then
    echo "Error: please check $tempFile"
    exit 1
  fi

  stats=$(egrep "Complete requests:|Failed requests:|Requests per second:|Total:" "$tempFile" | perl -pe 's/[^\d\.]+/ /g; s/^\s+//g;' | perl -pe 's/\s+$//g; s/\s+/,/g')
  histogram=$(egrep "90%|95%|99%" "$tempFile" | cut -c6- | perl -pe 's/[^\d\.]+/ /g; s/^\s+//g;' | perl -pe 's/\s+$//g; s/\s+/,/g')

  echo "$startTime,$concurrency,$stats,$histogram" | tee -a "$OUTPUT_FILE"

  tcpConnections=$($SS_PATH -s | egrep 'TCP:' | perl -ne '/(\d+)/; print $1')
  while [ $((tcpConnections+concurrency+CONCURRENCY_STEP)) -ge "$MAX_OPEN_FILES" ]
  do
    sleep 1
    tcpConnections=$($SS_PATH -s | egrep 'TCP:' | perl -ne '/(\d+)/; print $1')
  done
done

rm -rf temp

echo "Finished benchmark, see $OUTPUT_FILE"

May 10, 2012

Finagle 概述

Finagle 是基於JVM的網路框架/堆疊,開發者能在之上使用Java,Scala或任何基於JVM程式語言來開發非同步運算的RPC Client/Server,其專案是 Twitter 總結了實務上遇到的問題,所設計的分散式系統方案,並在2011的7月開放源始碼,目前功能如下表


Protocols

  • HTTP
  • Streaming HTTP (Comet)
  • Thrift
  • Memcached/Kestrel
  • More to come!

Server Support

  • Backpressure (to defend against abusive clients)
  • Service Registration (e.g., via Zookeeper)
  • Native OpenSSL bindings

Client Support

  • Connection Pooling
  • Load Balancing
  • Failure Detection
  • Failover/Retry
  • Distributed Tracing (a la Dapper)
  • Service Discovery (e.g., viaZookeeper)
  • Rich Statistics
  • Native OpenSSL bindings
  • Sharding


Apr 7, 2012

ScalaSTM 試用

STM(software transactional memory)是種 concurrency control 機制,相對於用 locking 進行同步化,更安全與簡單,以下使用帳戶轉帳的例子實作比較locking與STM

帳號轉帳案例
多組帳號同時以隨機金額進行轉帳,轉帳過程要保證沒有 race condition 與避免 deadlock