仙人掌世界: performance

Showing posts with label performance. Show all posts

Apr 19, 2013

golang - Profiling Go Programs 記錄(Part 5)

結論

這裡用的 Profiling 方法跟在 Java 沒什麼太大差別，仍是使用工具找出程式與資源間的關系, 會需要看點 Go 函式庫源始碼, 還好 Go net/http 函式庫的程式碼大部份都不會很長還滿好懂，過程也沒遇到什麼奇怪的狀況。

內建的 Profiling 工具某方面比 Java 內建的 JVisualVM 還要強大，能顯示關系圖，把數據與程式碼並列，像 block profiling 功能還是在 Solaris Studio 才有的進階選項, 雖然功能相較其它成熟語言的分析工具來的少，但一般狀況夠用了。

GC 目前沒找到專門的工具, 目前只能印出 gclog 來看數據，當 GC 頻率高的時候 log 就不好讀了，於是另外做了兩個小工具來協助分析。 profilingtool.go log2csv

其它一些 profiling 過程遇到零碎問題與解決方沒寫在文章裡，直接舉列在下方清單

Gotcha

defer func 堆疊在 function return 時才會釋放, 若包在迴圈使用沒釋放的話，profiling 時候會看到堆了很多 newdefer 跟 deferproc (由於忘了在哪裡踩到的，就沒寫在文章裡)

Standard Library

net 底下的 read/write 都是跑 asynchronous 的，所以 cpu profiling 展開 gosched0 後會是各別 goroutine 的執行片段，片段之間的連帶關係還是要靠自已去抓出來的，工具沒那麼聰明
Go1.1 有 timer.Reset 可以減少 Timer 重建
預設的 http package 裡沒有提供 timeout api，要把 http.Client 裡的 http.RoundTripper 整組換掉才能做到各階段連線與讀寫傳輸的 timeout 控制
- net/http: Default{Client,Transport} need better timeout support when used with connection pooling/redirects
- Go 1.1 可以用 timer + Transport.CancelRequest自行控制

Runtime

GC & Goroutine scheduling 不是 free 的, pprof 是抓漏好幫手
目前 GC 實作還是陽春了點，能調效 latency 的手段有限
- Go1.1runtime.SetGCPercent
- Go1.1runtime.FreeOSMemory
- 用 cgo 自已管記憶體分配?
GC 可以過透過 GOGC=off(or SetGCPercent(-1)) 配合 mem profiler 分析完整的 heap 配置狀況
channel send/recv 跟 select 的 contention 會造成瓶頸，不過還沒遇到這種狀況，表面的知道
- cpu profiling 可以觀察有這類操作在消耗 cpu 資源，進而分析要不要調整 channel 的 buffer size，或是更換同步方式與演算法
  - runtime.xchg / selectgo
  - runtime.chanrecv
  - runtime.chansend
- block profiling 可以看 goroutine 被 block 的 delay 時間

Tool

內建的 go test 有提供 benchmark 的測試方法, 也能參數化加入 profiler, 有寫了一些但還是覺得整合的壓力測試不太會寫...

Other Issues

預設的 Http over TLS CPU吃很兇,有查到問題後就沒再試下去
- net/http: ssl performance problem
- 可以接OpenSSL來解決Apcera Case Study: The selection of the Go language
- 在 server side 要用的話，也有 ssl-offload 的 software/hardware 可以擋在前端

TODO

多核環境(>8)觀察 contention 與 GC 效能

Reference:

Profiling Go Programs

Apr 18, 2013

golang - Profiling Go Programs 記錄(Part 4)

4. 快取計算結果

第2與3項只修正 bug 但對效能沒有直接改善，以目前的觀測結果，程式向系統申請的記憶體用量還有點高(350M左右)，接著來檢查 GC 與 heap 狀況看哪些物件申請了這些記憶體

用 GOGCTRACE=1 讓 GO runtime 把 gclog 印出來，將印出的訊息傳給 log2csv 轉成 csv 檔案

$ GOGCTRACE=1 ./gb -c 100 -n 200000 -k http://localhost/10k.dat 2>&1 | log2csv -t -o gc.csv

unixtime,numgc,nproc,mark,sweep,cleanup,heap0,heap1,obj0,obj1,nmalloc,nfree,nhandoff,nhandoffcnt,nsteal,nstealcnt,nprocyield,nosyield,nsleep
1366278874.698135,1,1,0,0,0,0,0,24,26,27,1,0,0,0,0,0,0,0
1366278874.701761,2,2,1,0,0,2,2,608,518,642,124,0,0,5,31,22,4,0
1366278874.707143,3,2,4,0,0,4,4,993,985,1118,133,0,0,2,41,22,15,6
1366278874.725975,4,2,6,0,0,9,9,36182,36148,36317,169,6,187,2,78,70,24,17
1366278874.767119,5,2,13,0,0,18,18,113518,113518,113688,170,3,87,2,301,42,23,21
1366278874.837911,6,2,16,1,0,37,37,268246,268246,268417,171,4,93,2,48,54,20,11
1366278874.982132,7,2,35,2,0,75,75,577694,577694,577866,172,6,287,2,306,42,23,31
1366278875.267363,8,2,67,5,1,150,150,1196615,1196615,1196788,173,5,182,2,2063,32,23,11
1366278877.850924,9,2,64,35,8,301,154,2212070,1183694,2284625,1100931,8,502,13,16101,16,14,2
1366278879.561168,10,2,63,23,5,309,147,2233288,1119515,3411483,2291968,2,47,4,10961,32,16,13
1366278881.097839,11,2,70,40,2,294,139,2121182,1054221,4486980,3432759,2,17,4,17879,32,5,0
1366278882.543087,12,2,57,24,7,278,131,2000935,993438,5503542,4510104,1,2,3,9999,19,12,12
1366278883.873273,13,2,48,25,2,262,124,1888388,935583,6464524,5528941,3,152,2,10759,26,22,2
(略)

gc.csv 再用 gnuplot 將 csv 檔案裡的數據繪制成圖表 gnuplot-heap.p

$ gnuplot gnuplot-heap.p

看起來是程式開始申請了過量的記憶體，GC 清空後 runtime 也沒把多餘的空間再還給作業系統，再分析 gc.csv 後找到大量申請時間約在第6-10次gc, 於程式啟動後第5秒的時候達到最大值

為了要找出開始大量申請空間的物件，下面的程式在第1秒與第7秒各 dump 一次 heap

func main() {
    SetMemProfileRate(1)
    go func() {
        time.Sleep(time.Duration(1) * time.Second)
        DumpHeap()
        time.Sleep(time.Duration(6) * time.Second)
        DumpHeap()
    }()
    // ignored
    PrintGCSummary()
}

$ ./gb -c 100 -n 200000 http://localhost/10k.dat
(略)
NumGC:56 Pause:113.57ms Pause(Avg):425.33ms Overhead:30.54% Alloc:22.53M Sys:662.14M Alloc(Rate):43.24M/s Histogram:996.58ms 1.07s 1.07s

在 go tool pprof 加上 --base 顯示兩個 profile 相減後的數據

$ go tool pprof ./gb --base heap-19221-1.pprof heap-19221-2.pprof 
(pprof) top
Total: 74.3 MB
    35.6  47.9%  47.9%     35.6  47.9% net/textproto.MIMEHeader.Set
    26.3  35.4%  83.3%     37.1  50.0% net/http.NewRequest
    10.8  14.6%  97.9%     10.8  14.6% net/url.parse
     1.5   2.1% 100.0%      1.5   2.1% net/textproto.MIMEHeader.Add
     0.0   0.0% 100.0%      0.0   0.0% resizefintab
     0.0   0.0% 100.0%      0.0   0.0% itab
     0.0   0.0% 100.0%      0.0   0.0% os.NewFile
     0.0   0.0% 100.0%      0.0   0.0% concatstring
(pprof) top -cum
Total: 74.3 MB
     0.0   0.0%   0.0%     74.3 100.0% gosched0
     0.0   0.0%   0.0%     74.3 100.0% main.(*Benchmark).Run
     0.0   0.0%   0.0%     74.3 100.0% main.NewHttpRequest
     0.0   0.0%   0.0%     74.3 100.0% main.main
     0.0   0.0%   0.0%     74.3 100.0% main.startBenchmark
     0.0   0.0%   0.0%     74.3 100.0% runtime.main
    26.3  35.4%  35.4%     37.1  50.0% net/http.NewRequest
     0.0   0.0%  35.4%     35.6  47.9% net/http.Header.Set
    35.6  47.9%  83.3%     35.6  47.9% net/textproto.MIMEHeader.Set
     0.0   0.0%  83.3%     10.8  14.6% net/url.Parse

比較兩個 heap 過後，總共增加了 74.3 MB，從top -cum顯示這 74.3MB 100.0% 是 main.(*Benchmark).Run 底下產生的

(pprof) web main.\(\*Benchmark\).Run

SVG File

(pprof) list main.\(\*Benchmark\).Run
Total: 74.3 MB
ROUTINE ====================== main.(*Benchmark).Run in /home/brandon/workspaces/go/src/bench/gb/benchmark.go
   0.0   74.3 Total MB (flat / cumulative)
     .      .   27:     monitor := NewMonitor(config, collector, start, stop)
     .      .   28: 
     .      .   29:     return &Benchmark{config, collector, monitor, start, stop}
     .      .   30: }
     .      .   31: 
---
     .      .   32: func (b *Benchmark) Run() {
     .      .   33: 
     .      .   34:     go b.monitor.Run()
     .      .   35: 
     .      .   36:     jobs := make(chan *http.Request, b.config.requests)
     .      .   37: 
     .      .   38:     for i := 0; i < b.config.concurrency; i++ {
     .      .   39:         go NewHttpWorker(b.config, jobs, b.collector, b.start, b.stop).Run()
     .      .   40:     }
     .      .   41: 
     .      .   42:     for i := 0; i < b.config.requests; i++ {
     .   74.3   43:         newRequest, _ := NewHttpRequest(b.config)
     .      .   44:         jobs <- newRequest
     .      .   45:     }
     .      .   46:     close(jobs)
     .      .   47: 
     .      .   48:     close(b.start)
     .      .   49:     <-b.stop
     .      .   50: }
---
(pprof)

main.(*Benchmark).Run 的 L:43 NewHttpRequest 增加了 74.3 MB 的記憶體，從程式行為來分析，問題是L：42的迴圈在一開始就產生了所有 NewHttpRequest，另一個是 NewHttpRequest 是個昂貴的操作為了建立 http.NewRequest 衍生使用了 textproto.MIMEHeader.Set 與 net/url.parse

優化方法：複製 httpRequest 結果避免同樣的計算與空間浪費

func (b *Benchmark) Run() {

    go b.monitor.Run()

    jobs := make(chan *http.Request, b.config.requests)

    for i := 0; i < b.config.concurrency; i++ {
        go NewHttpWorker(b.config, jobs, b.collector, b.start, b.stop).Run()
    }
    base, _ := NewHttpRequest(b.config)
    for i := 0; i < b.config.requests; i++ {
        newRequest := CopyHttpRequest(b.config, base)
        jobs <- newRequest
    }
    close(jobs)

    close(b.start)
    <-b.stop
}

func CopyHttpRequest(config *Config, request *http.Request) *http.Request {

    if config.method == "POST" || config.method == "PUT" {
        newRequest := *request
        if newRequest.Body != nil {
            newRequest.Body = ioutil.NopCloser(bytes.NewReader(config.bodyContent))
        }
        return &newRequest
    } else {
        return request
    }

}

執行結果

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat
(略)
Document Path:          /10k.dat
Document Length:        10240 bytes

Concurrency Level:      100
Time taken for tests:   12.14 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    16471.31 [#/sec] (mean)
Time per request:       6.071 [ms] (mean)
Time per request:       0.061 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     164713.14 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   3.50    4   38

Percentage of the requests served within a certain time (ms)
 50%     4
 66%     5
 75%     7
 80%     7
 90%     10
 95%     12
 98%     14
 99%     17
 100%    38 (longest request)

NumGC:49 Pause:6.15ms Pause(Avg):7.52ms Overhead:3.00% Alloc:18.19M Sys:284.25M Alloc(Rate):64.70M/s Histogram:10.50ms 13.11ms 13.11ms

$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 9529 brandon   20   0  732m  20m 2860 S  29.7  0.1   0:00.30 gb
 9529 brandon   20   0  942m  23m 2864 S 103.6  0.1   0:01.35 gb
 9529 brandon   20   0 1016m  24m 2864 S 105.1  0.2   0:02.41 gb
 9529 brandon   20   0 1016m  24m 2864 S 103.5  0.2   0:03.46 gb
 9529 brandon   20   0 1088m  25m 2864 S 104.1  0.2   0:04.51 gb
 9529 brandon   20   0 1088m  25m 2864 S 104.0  0.2   0:05.56 gb
 9529 brandon   20   0 1088m  26m 2864 S 102.2  0.2   0:06.59 gb
 9529 brandon   20   0 1088m  26m 2864 S 103.8  0.2   0:07.64 gb
 9529 brandon   20   0 1160m  26m 2864 S 104.2  0.2   0:08.69 gb
 9529 brandon   20   0 1160m  26m 2864 S 105.1  0.2   0:09.75 gb
 9529 brandon   20   0 1160m  27m 2864 S 103.2  0.2   0:10.79 gb
 9529 brandon   20   0 1160m  27m 2864 S 103.1  0.2   0:11.83 gb

向系統申請記憶體終於在這次測試裡降下來(25MB vs 350MB)，平均 GC 停頓時間也降低(7.52ms vs 40.91ms)

gc2.csv

golang - Profiling Go Programs 記錄(Part 3)

3. 正常結束 Goroutine

$ go tool pprof ./gb heap-6438-1.pprof
(pprof) top10
Total: 8.6 MB
     1.5  17.8%  17.8%      3.0  35.3% main.(*Benchmark).Run
     1.5  17.8%  35.6%      1.5  17.8% main.(*Monitor).Run
     1.5  17.8%  53.4%      1.5  17.8% main.NewBenchmark
     1.5  17.5%  70.9%      1.5  17.5% net/textproto.MIMEHeader.Set
     1.0  11.7%  82.5%      1.0  11.7% bufio.NewReaderSize
     1.0  11.7%  94.2%      1.0  11.7% bufio.NewWriterSize
     0.5   5.8% 100.0%      0.5   5.8% newdefer
     0.0   0.0% 100.0%      0.5   5.8% bufio.(*Writer).Flush
     0.0   0.0% 100.0%      1.0  11.7% bufio.NewReader
     0.0   0.0% 100.0%      1.0  11.7% bufio.NewWriter

程式結束前仍有 8.6 MB 的 object 沒有被 GC 回收，但從搜集到的 heap 資訊看不出來有任何存在的 object reference 改從 goroutine 的 callstack 來追查是否為這些物件的 GC root

func main() {
    SetMemProfileRate(1) //(請參考第一篇附件 profilingtool.go)
    // ignored
    GC()    
    DumpHeap()
    PrintGCSummary()
    panic("show me the callstack")
}

在 Go 裡面只要不去攔截 panic 訊息就會把所有的 goroutine callstack 都印出來，所以這裡降低 gorutine 的數量並且把輸出訊息導到 err.log 避免 terminal buffer 不足導致訊息被清掉

$ ./gb -c 2 -n 10 -k http://localhost/10k.dat 2>&1 | tee err.log
(略)

goroutine 1 [running]:
main.main()
    /home/brandon/workspaces/go/src/bench/gb/gb.go:66 +0x129

(略）
goroutine 9 [chan receive]:
main.(*HttpWorker).Run(0xc200098e00)
    /home/brandon/workspaces/go/src/bench/gb/http.go:37 +0x57
created by main.(*Benchmark).Run
    /home/brandon/workspaces/go/src/bench/gb/benchmark.go:39 +0xbb

goroutine 10 [chan receive]:
main.(*HttpWorker).Run(0xc200098e40)
    /home/brandon/workspaces/go/src/bench/gb/http.go:37 +0x57
created by main.(*Benchmark).Run
    /home/brandon/workspaces/go/src/bench/gb/benchmark.go:39 +0xbb
(略）

從 callstacks 可以看到 goroutine 9 / 10 的狀態都是 [chan receive] 並且停在 http.go:37

func (h *HttpWorker) Run() {
    <-h.start

    for job := range h.jobs { // L:37
        //ignored
        go h.send(job)
        //ignored
    }
}

狀況是 worker 在等 jobs channel 的 job，而 Benchmark 也已經結束不再傳送 job 給 jobs channel

修正方法

func (b *Benchmark) Run() {

    go b.monitor.Run()

    jobs := make(chan *http.Request, b.config.requests)

    for i := 0; i < b.config.concurrency; i++ {
        go NewHttpWorker(b.config, jobs, b.collector, b.start, b.stop).Run()
    }

    for i := 0; i < b.config.requests; i++ {
        request, _ := NewHttpRequest(b.config)
        jobs <- request
    }
    close(jobs) // 已經不會再建立新 job 就可以關了, 別讓接收 job 的 httpworker 讓空的 jobs channel 給卡住

    close(b.start)
    <-b.stop
}

執行結果

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat
$ go tool pprof ./gb heap-32397-1.pprof
(pprof) top
Total: 0.2 MB
     0.0  30.6%  30.6%      0.0  30.6% runtime.cnew
     0.0  14.9%  45.5%      0.0  15.4% main.NewHttpWorker
     0.0  12.2%  57.8%      0.0  12.2% newdefer
     0.0   7.5%  65.3%      0.0   7.5% bufio.NewReaderSize
     0.0   7.5%  72.7%      0.0   7.5% bufio.NewWriterSize
     0.0   4.9%  77.6%      0.0   4.9% io/ioutil.blackHole
     0.0   4.8%  82.4%      0.0  24.8% net/http.(*Transport).dialConn
     0.0   3.5%  85.9%      0.0   3.5% runtime.malg
     0.0   1.2%  87.1%      0.0   1.2% net/textproto.(*Reader).ReadMIMEHeader
     0.0   1.1%  88.2%      0.0   1.6% flag.(*FlagSet).Var

程式離開前的記憶體從 8.6 MB 降到 0.2 MB

golang - Profiling Go Programs 記錄(Part 2)

2. 程式化停止 Timer

$ go tool pprof ./gb cpu-27989.pprof 
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 967 samples
      61   6.3%   6.3%       61   6.3% runtime.xchg
      44   4.6%  10.9%       44   4.6% runtime.memmove
      34   3.5%  14.4%       34   3.5% net/textproto.canonicalMIMEHeaderKey
      29   3.0%  17.4%       31   3.2% runtime.settype_flush
      27   2.8%  20.2%      227  23.5% runtime.mallocgc
      26   2.7%  22.9%       26   2.7% runtime.casp
      25   2.6%  25.4%       87   9.0% runtime.MCache_Alloc
      25   2.6%  28.0%       46   4.8% runtime.markallocated
      23   2.4%  30.4%       23   2.4% bytes.IndexByte
      23   2.4%  32.8%       92   9.5% selectgo

$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
27989 brandon   20   0  753m 176m 2844 S  57.9  1.1   0:00.59 gb
27989 brandon   20   0 1012m 234m 2852 S 116.8  1.5   0:01.78 gb
27989 brandon   20   0 1058m 280m 2852 S 107.6  1.8   0:02.87 gb
27989 brandon   20   0 1105m 327m 2852 S 107.7  2.1   0:03.96 gb
27989 brandon   20   0 1194m 346m 2852 S 116.9  2.2   0:05.14 gb
27989 brandon   20   0 1195m 347m 2852 S 107.0  2.2   0:06.22 gb
27989 brandon   20   0 1195m 347m 2852 S 103.9  2.2   0:07.27 gb
27989 brandon   20   0 1275m 348m 2852 S 116.6  2.2   0:08.45 gb
27989 brandon   20   0 1277m 350m 2852 S 111.7  2.2   0:09.58 gb
27989 brandon   20   0 1277m 350m 2852 S 110.8  2.2   0:10.70 gb
27989 brandon   20   0 1349m 351m 2852 S 113.6  2.2   0:11.85 gb
27989 brandon   20   0 1359m 353m 2852 S 115.9  2.2   0:13.02 gb

這時的 cpu profiling 已經比較分散平均，看不太出明顯熱點，另外從 top 數據來看 350m 的記憶體用量有些高，改用 heap profiler 觀察看看是否有 memory leak 的狀況

在程式內加入 Mem Profiling 程式

func main() {
    SetMemProfileRate(1) //(請參考附件 profilingtool.go)
    // ignored
    GC()    
    DumpHeap()
    PrintGCSummary()
}

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat
（略）
Concurrency Level:      100
Time taken for tests:   25.33 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    7896.70 [#/sec] (mean)
Time per request:       12.664 [ms] (mean)
Time per request:       0.127 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     78966.96 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   76.91   6   1866

Percentage of the requests served within a certain time (ms)
 50%     6
 66%     8
 75%     10
 80%     11
 90%     14
 95%     16
 98%     18
 99%     21
 100%    1866 (longest request)

NumGC:13 Pause:731.86ms Pause(Avg):604.31ms Overhead:27.92% Alloc:46.39M Sys:696.17M Alloc(Rate):33.91M/s Histogram:1.85s 1.85s 1.85s

這裡用設定的 sample rate 比較小，搜集的資料比較完整，相對也比較多效能消耗，這裡的平均停頓時間就高了許多(604.31ms vs 40.91ms)

brandon@brandon-laptop:~/workspaces/go/src/bench/gb$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
29236 brandon   20   0  487m  48m 2944 S  48.1  0.3   0:00.49 gb
29236 brandon   20   0  595m 156m 2944 S 102.1  1.0   0:01.52 gb
29236 brandon   20   0  808m 233m 2956 S 123.9  1.5   0:02.77 gb
29236 brandon   20   0  916m 278m 2956 S 137.8  1.8   0:04.16 gb
29236 brandon   20   0  960m 322m 2956 S 137.7  2.0   0:05.55 gb
29236 brandon   20   0 1004m 366m 2956 S 135.7  2.3   0:06.92 gb
29236 brandon   20   0 1047m 410m 2956 S 136.7  2.6   0:08.30 gb
29236 brandon   20   0 1072m 434m 2956 S 153.2  2.7   0:09.85 gb
29236 brandon   20   0 1072m 434m 2956 S 171.4  2.7   0:11.58 gb
29236 brandon   20   0 1072m 434m 2956 S 141.8  2.7   0:13.01 gb
29236 brandon   20   0 1072m 435m 2956 S 137.7  2.8   0:14.40 gb
29236 brandon   20   0 1072m 436m 2956 S 132.6  2.8   0:15.74 gb
29236 brandon   20   0 1072m 436m 2956 S 135.7  2.8   0:17.11 gb
29236 brandon   20   0 1072m 436m 2956 S 135.3  2.8   0:18.48 gb
29236 brandon   20   0 1072m 436m 2956 S 168.4  2.8   0:20.18 gb
29236 brandon   20   0 1072m 436m 2956 S 153.7  2.8   0:21.73 gb
29236 brandon   20   0 1073m 438m 2956 S 126.6  2.8   0:23.01 gb
29236 brandon   20   0 1073m 438m 2956 S 130.8  2.8   0:24.33 gb
29236 brandon   20   0 1073m 438m 2956 S 132.8  2.8   0:25.67 gb
29236 brandon   20   0 1073m 438m 2956 S 136.6  2.8   0:27.05 gb
29236 brandon   20   0 1073m 438m 2956 S 156.5  2.8   0:28.63 gb
29236 brandon   20   0 1073m 438m 2956 S 143.8  2.8   0:30.08 gb
29236 brandon   20   0 1073m 439m 2956 S 126.5  2.8   0:31.36 gb
29236 brandon   20   0 1076m 441m 2956 S 134.6  2.8   0:32.72 gb
29236 brandon   20   0 1076m 441m 2956 S 140.6  2.8   0:34.14 gb
29236 brandon   20   0 1076m 441m 2956 S 153.5  2.8   0:35.69 gb
29236 brandon   20   0 1076m 441m 2956 S 139.6  2.8   0:37.10 gb
29236 brandon   20   0 1076m 441m 2956 S 131.7  2.8   0:38.43 gb

$ go tool pprof ./gb heap-29236-1.pprof
(pprof) top10
Total: 42.4 MB
    33.6  79.1%  79.1%     35.6  83.9% time.NewTimer
     2.0   4.8%  83.9%      2.0   4.8% addtimer
     1.5   3.6%  87.5%      1.5   3.6% main.NewBenchmark
     1.5   3.6%  91.1%      2.9   6.7% main.(*Benchmark).Run
     1.2   2.8%  93.9%      1.2   2.9% main.NewHttpWorker
     0.6   1.5%  95.4%      0.6   1.5% newdefer
     0.6   1.4%  96.8%      0.6   1.4% bufio.NewReaderSize
     0.4   1.0%  97.8%      0.4   1.0% bufio.NewWriterSize
     0.3   0.7%  98.5%      1.6   3.8% net/http.(*Transport).dialConn
     0.1   0.3%  98.8%      0.1   0.3% runtime.malg

程式要結束前還有 42.4 MB 沒被回收，有 memory leak

(pprof) web time.NewTimer

SVG File

從圖中來看 timer.NewTimer 是 main.(*HttpWorker).Run 裡面的 time.After(d timer.Duration) 建立的

(pprof) list main.\(\*HttpWorker\).Run
Total: 42.4 MB
ROUTINE ====================== main.(*HttpWorker).Run in /home/brandon/workspaces/go/src/bench/gb/http.go
   0.0   35.6 Total MB (flat / cumulative)
     .      .   29: 
     .      .   30: var (
     .      .   31:     invalidContnetSize = errors.New("invalid content size")
     .      .   32: )
     .      .   33: 
---
     .      .   34: func (h *HttpWorker) Run() {
     .      .   35:     <-h.start
     .      .   36: 
     .      .   37:     for job := range h.jobs {
     .      .   38: 
   0.0    0.0   39:         executionResult := make(chan *Record)
     .      .   40: 
     .    0.0   41:         go h.send(job, executionResult)
     .      .   42: 
     .      .   43:         select {
     .      .   44:         case record := <-executionResult:
     .      .   45:             h.collector <- record
     .      .   46: 
     .   35.6   47:         case <-time.After(time.Duration(MAX_RESPONSE_TIMEOUT) * time.Second):
     .      .   48:             h.collector <- &Record{Error: errors.New("execution timeout")}
     .      .   49: 
     .      .   50:         case <-h.stop:
     .      .   51:             return
     .      .   52:         }
     .      .   53:     }
     .      .   54: 
     .      .   55: }
---
     .      .   56: 
     .      .   57: func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
     .      .   58: 
     .      .   59:     record := &Record{}
     .      .   60:     sw := &StopWatch{}

select 裡包 time.After(Duration)這種 timeout control用法，在官方文件裡有很多地方都有使用，追查過後才知道，這方法使用的 Timer 等時間過期就會被 GC 回收了，但這裡的狀況是需要對每組 http request/response 各別做 timeout control，會大量使用到 Timer 若要放給 GC 回收會有點浪費記憶體，也會額外增加 GC 掃描的時間

修正方法

func (h *HttpWorker) Run() {
    <-h.start

    for job := range h.jobs {

        timeout := time.NewTimer(time.Duration(MAX_RESPONSE_TIMEOUT) * time.Second)
        executionResult := make(chan *Record)

        go h.send(job, executionResult)

        select {
        case record := <-executionResult:
            h.collector <- record

        case <-timeout.C:
            h.collector <- &Record{Error: errors.New("execution timeout")}

        case <-h.stop:
            timeout.Stop()  
            return
        }
        timeout.Stop()      
    }

}

執行結果

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat 
$ go tool pprof ./gb heap-6438-1.pprof
(pprof) top10
Total: 8.6 MB
     1.5  17.8%  17.8%      3.0  35.3% main.(*Benchmark).Run
     1.5  17.8%  35.6%      1.5  17.8% main.(*Monitor).Run
     1.5  17.8%  53.4%      1.5  17.8% main.NewBenchmark
     1.5  17.5%  70.9%      1.5  17.5% net/textproto.MIMEHeader.Set
     1.0  11.7%  82.5%      1.0  11.7% bufio.NewReaderSize
     1.0  11.7%  94.2%      1.0  11.7% bufio.NewWriterSize
     0.5   5.8% 100.0%      0.5   5.8% newdefer
     0.0   0.0% 100.0%      0.5   5.8% bufio.(*Writer).Flush
     0.0   0.0% 100.0%      1.0  11.7% bufio.NewReader
     0.0   0.0% 100.0%      1.0  11.7% bufio.NewWriter

time.NewTimer 從清單中消失了，接下來還有 8.6 MB 的空間在程式結束前沒被回收，下一篇繼續來追蹤

golang - Profiling Go Programs 記錄(Part 1)

利用了 profiling 工具解決了一些 Go-HttpBench 的 bug 並做了些效能優化

以 ./gb -c 100 -n 200000 -k http://localhost/10k.dat 來比較版本 0.1.0 與 0.1.6 的結果(based on Go1.1beta2)

        Mem(RES)   Requests/s   Alloc(Rate)   GC-Overhead   Pause(Avg)   Pause(Max)
0.1.0       354m     12119.99     397.00M/s        25.40%      56.72ms     153.25ms
0.1.6        23m     16976.76      67.48M/s         2.50%       5.12ms      10.07ms
delta(%)    ↓93%         ↑40%          ↓83%        ↓22.9%         ↓91%         ↓93%

Response time   0.1.0   0.1.6
 50%                4       4
 66%                6       5
 75%                7       6
 80%                8       7
 90%               10       9
 95%               12      11
 98%               57      13
 99%               70      15
 100%             179     133

優化過程主要做了下列四項改變，以下將包含總結分五個章節來發佈

1. 重複使用 Buffer

先在程式內加入 CPU Profiling 程式，分析有沒有明顯的效能瓶頸 (用的指令請參考附件 profilingtool.go)

func main() {
    StartCPUProfile()
    // ignored
    StopCPUProfile()
    PrintGCSummary()
}

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat

This is GoHttpBench, Version 0.1.0, https://github.com/parkghost/gohttpbench
Author: Brandon Chen, Email: parkghost@gmail.com
Licensed under the Apache License, Version 2.0

Benchmarking localhost (be patient)
Completed 20000 requests
Completed 40000 requests
Completed 60000 requests
Completed 80000 requests
Completed 100000 requests
Completed 120000 requests
Completed 140000 requests
Completed 160000 requests
Completed 180000 requests
Completed 200000 requests
Finished 200000 requests

Server Software:        nginx/1.2.1
Server Hostname:        localhost
Server Port:            80

Document Path:          /10k.dat
Document Length:        10240 bytes

Concurrency Level:      100
Time taken for tests:   19.00 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    10526.49 [#/sec] (mean)
Time per request:       9.500 [ms] (mean)
Time per request:       0.095 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     105264.94 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   12.79   5   164

Percentage of the requests served within a certain time (ms)
 50%     5
 66%     7
 75%     8
 80%     9
 90%     12
 95%     16
 98%     60
 99%     77
 100%    164 (longest request)

NumGC:77 Pause:41.17ms Pause(Avg):61.92ms Overhead:24.21% Alloc:48.57M Sys:614.81M Alloc(Rate):346.57M/s Histogram:108.41ms 124.72ms 124.72ms

用 top 搜集 gb 相關系統資源使用狀況

$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
27047 brandon   20   0  597m  10m 2840 S   4.9  0.1   0:00.05 gb
27047 brandon   20   0 1054m 326m 2848 S 123.9  2.1   0:01.30 gb
27047 brandon   20   0 1343m 349m 2852 S 114.8  2.2   0:02.46 gb
27047 brandon   20   0 1343m 349m 2852 S 100.0  2.2   0:03.47 gb
27047 brandon   20   0 1343m 350m 2852 S 125.1  2.2   0:04.74 gb
27047 brandon   20   0 1343m 350m 2852 S 134.4  2.2   0:06.10 gb
27047 brandon   20   0 1343m 350m 2852 S 132.8  2.2   0:07.44 gb
27047 brandon   20   0 1343m 350m 2852 S 124.7  2.2   0:08.70 gb
27047 brandon   20   0 1343m 351m 2852 S  95.9  2.2   0:09.67 gb
27047 brandon   20   0 1487m 351m 2852 S  99.0  2.2   0:10.67 gb
27047 brandon   20   0 1487m 351m 2852 S 109.6  2.2   0:11.78 gb
27047 brandon   20   0 1487m 351m 2852 S  85.2  2.2   0:12.64 gb
27047 brandon   20   0 1488m 353m 2852 S 104.5  2.2   0:13.70 gb
27047 brandon   20   0 1488m 353m 2852 S 128.1  2.2   0:15.00 gb
27047 brandon   20   0 1488m 353m 2852 S 129.6  2.2   0:16.31 gb
27047 brandon   20   0 1488m 353m 2852 S 129.3  2.2   0:17.62 gb
27047 brandon   20   0 1488m 354m 2852 S 130.6  2.2   0:18.94 gb
27047 brandon   20   0 1488m 354m 2852 S 130.7  2.2   0:20.26 gb
27047 brandon   20   0 1634m 356m 2852 S 127.5  2.3   0:21.55 gb
27047 brandon   20   0 1634m 356m 2852 S 131.5  2.3   0:22.88 gb

用內建的 profiling 工具來找程式熱點

$ go tool pprof ./gb cpu-27047.pprof 
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 1569 samples
     183  11.7%  11.7%      356  22.7% scanblock
      74   4.7%  16.4%       74   4.7% runtime.memmove
      60   3.8%  20.2%       60   3.8% runtime.casp
      59   3.8%  24.0%       59   3.8% runtime.xchg
      51   3.3%  27.2%       72   4.6% sweepspan
      50   3.2%  30.4%       50   3.2% runtime.memclr
      45   2.9%  33.3%       56   3.6% markonly
      39   2.5%  35.8%       66   4.2% flushptrbuf
      34   2.2%  37.9%       55   3.5% runtime.markallocated
      32   2.0%  40.0%       32   2.0% net/textproto.canonicalMIMEHeaderKey

scanblock method 佔用了 11% 的 cpu 資源，來查看看

(pprof) web scanblock
Total: 1569 samples
After focusing on 'scanblock': 356 samples of 1569 (22.7%)
Dropping nodes with <= 1 samples; edges with <= 0 abs(samples)
Loading web page file:////tmp/pprof27099.1.svg
Created new window in existing browser session.
(pprof)

SVG File

按 scanblock 上方粗線追朔，看起來是 makeslice 建立許多 slice 使的 gc 很忙，延粗線再往上找到 main package 的 main.(*HttpWorker).send 可能是問題來源

(pprof) list main.*.send
Total: 1569 samples
ROUTINE ====================== main.(*HttpWorker).send in /home/brandon/workspaces/go/src/bench/gb/http.go
     7    798 Total samples (flat / cumulative)
     .      .   49:         }
     .      .   50:     }
     .      .   51: 
     .      .   52: }
     .      .   53: 
---
     .      .   54: func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
     .      .   55: 
     .      4   56:     record := &Record{}
     .      .   57:     sw := &StopWatch{}
     .      .   58:     sw.Start()
     .      .   59: 
     .      .   60:     var contentSize int
     .      .   61: 
     .      .   62:     defer func() {
     .      .   63:         if r := recover(); r != nil {
     .      .   64:             record.Error = &ExceptionError{errors.New(fmt.Sprint(r))}
     .      .   65:         } else {
     .      .   66:             record.contentSize = contentSize
     .      .   67:             record.responseTime = sw.Elapsed
     .      .   68:         }
     .      .   69: 
     .      .   70:         if record.Error != nil {
     .      .   71:             TraceException(record.Error)
     .      .   72:         }
     .      .   73: 
     .      .   74:         executionResult <- record
     1      1   75:     }()
     .      .   76: 
     1    188   77:     resp, err := h.client.Do(request)
     .      .   78:     if err != nil {
     .     21   79:         record.Error = &ConnectError{err}
     .      .   80:         return
     .      .   81:     } else {
     .      3   82:         defer resp.Body.Close()
     .      .   83: 
     .      .   84:         if resp.StatusCode < 200 || resp.StatusCode > 300 {
     .      .   85:             record.Error = &ResponseError{err}
     .      .   86:             return
     .      .   87:         }
     .      .   88: 
     1    575   89:         body, err := ioutil.ReadAll(resp.Body)
     .      .   90: 
     .      .   91:         if err != nil {
     .      .   92:             record.Error = &ReceiveError{err}
     .      .   93:             return
     .      .   94:         }
     .      .   95:         contentSize = len(body)
     .      .   96: 
     4      4   97:         if contentSize != h.config.contentSize {
     .      .   98:             record.Error = &LengthError{invalidContnetSize}
     .      .   99:             return
     .      .  100:         }
     .      .  101: 
     .      .  102:     }
     .      .  103: 
     .      .  104:     sw.Stop()
     .      .  105: 
     .      2  106: }
---
     .      .  107: 
     .      .  108: func detectHost(config *Config) error {
     .      .  109:     var err error
     .      .  110:     defer func() {
     .      .  111:         if r := recover(); r != nil {
(pprof)

ioutil.ReadAll(resp.Body) 內部佔用了最長的時間，可是沒看到有 make([]) 之類的語句

(pprof) list ioutil.ReadAll
（略)

(pprof) list ioutil.readAll
Total: 1569 samples
ROUTINE ====================== io/ioutil.readAll in /usr/lib/go/src/pkg/io/ioutil/ioutil.go
     4    574 Total samples (flat / cumulative)
     .      .   12:     "sort"
     .      .   13: )
     .      .   14: 
     .      .   15: // readAll reads from r until an error or EOF and returns the data it read
     .      .   16: // from the internal buffer allocated with a specified capacity.
---
     1      1   17: func readAll(r io.Reader, capacity int64) (b []byte, err error) {
     1     41   18:     buf := bytes.NewBuffer(make([]byte, 0, capacity))
     .      .   19:     // If the buffer overflows, we will get bytes.ErrTooLarge.
     .      .   20:     // Return that as an error. Any other panic remains.
     .      .   21:     defer func() {
     .      .   22:         e := recover()
     .      .   23:         if e == nil {
     .      .   24:             return
     .      .   25:         }
     .      .   26:         if panicErr, ok := e.(error); ok && panicErr == bytes.ErrTooLarge {
     .      .   27:             err = panicErr
     .      .   28:         } else {
     .      .   29:             panic(e)
     .      .   30:         }
     1      2   31:     }()
     1    527   32:     _, err = buf.ReadFrom(r)
     .      1   33:     return buf.Bytes(), err
     .      2   34: }
---
     .      .   35: 
     .      .   36: // ReadAll reads from r until an error or EOF and returns the data it read.
     .      .   37: // A successful call returns err == nil, not err == EOF. Because ReadAll is
     .      .   38: // defined to read from src until EOF, it does not treat an EOF from Read
     .      .   39: // as an error to be reported.
(pprof) list bytes.*ReadFrom
Total: 1569 samples
ROUTINE ====================== bytes.(*Buffer).ReadFrom in /usr/lib/go/src/pkg/bytes/buffer.go
     5    526 Total samples (flat / cumulative)
     .      .  145: 
     .      .  146: // ReadFrom reads data from r until EOF and appends it to the buffer, growing
     .      .  147: // the buffer as needed. The return value n is the number of bytes read. Any
     .      .  148: // error except io.EOF encountered during the read is also returned. If the
     .      .  149: // buffer becomes too large, ReadFrom will panic with ErrTooLarge.
---
     .      .  150: func (b *Buffer) ReadFrom(r io.Reader) (n int64, err error) {
     .      .  151:     b.lastRead = opInvalid
     .      .  152:     // If buffer is empty, reset to recover space.
     .      .  153:     if b.off >= len(b.buf) {
     1      2  154:         b.Truncate(0)
     .      .  155:     }
     .      .  156:     for {
     .      .  157:         if free := cap(b.buf) - len(b.buf); free < MinRead {
     .      .  158:             // not enough space at end
     .      .  159:             newBuf := b.buf
     .      .  160:             if b.off+free < MinRead {
     .      .  161:                 // not enough space using beginning of buffer;
     .      .  162:                 // double buffer capacity
     2    379  163:                 newBuf = makeSlice(2*cap(b.buf) + MinRead)
     .      .  164:             }
     .     13  165:             copy(newBuf, b.buf[b.off:])
     .      .  166:             b.buf = newBuf[:len(b.buf)-b.off]
     .      .  167:             b.off = 0
     .      .  168:         }
     1    131  169:         m, e := r.Read(b.buf[len(b.buf):cap(b.buf)])
     .      .  170:         b.buf = b.buf[0 : len(b.buf)+m]
     .      .  171:         n += int64(m)
     1      1  172:         if e == io.EOF {
     .      .  173:             break
     .      .  174:         }
     .      .  175:         if e != nil {
     .      .  176:             return n, e
     .      .  177:         }
     .      .  178:     }
     .      .  179:     return n, nil // err is EOF, so return nil explicitly
---
     .      .  180: }
     .      .  181: 
     .      .  182: // makeSlice allocates a slice of size n. If the allocation fails, it panics
     .      .  183: // with ErrTooLarge.
     .      .  184: func makeSlice(n int) []byte {

ioutil.readAll 的 L:18 可以分析問題是 main.(*HttpWorker).send 在每回呼叫 ioutil.ReadAll 都會動態產生一塊 Buffer 若在讀取資料的時候遇到 buffer size 不足，就再 makeSlice 產生更大的 buffer（bytes.ReadFrom:163）

為了避免一直創建 buffer 做了以下修改

type HttpWorker struct {
        // ignored
    readBuf   *bytes.Buffer
}

func NewHttpWorker(...ignored...) *HttpWorker {
    buf := bytes.NewBuffer(make([]byte, 0, config.contentSize+bytes.MinRead))
    return &HttpWorker{...ignored..., buf}
}

func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
    // ignored

    resp, _ := h.client.Do(request)

    defer resp.Body.Close()
    defer h.readBuf.Reset()

    var err error
    contentSize, err = h.readBuf.ReadFrom(resp.Body)

    // ignored
}

執行結果

$ ./gb -c 100 -n 200000 -k http://localhost/10k.dat
(略)
Document Path:          /10k.dat
Document Length:        10240 bytes

Concurrency Level:      100
Time taken for tests:   11.70 seconds
Complete requests:      200000
Failed requests:        0
HTML transferred:       2048000000 bytes
Requests per second:    17094.59 [#/sec] (mean)
Time per request:       5.850 [ms] (mean)
Time per request:       0.058 [ms] (mean, across all concurrent requests)
HTML Transfer rate:     170945.87 [Kbytes/sec] received

Connection Times (ms)
              min   mean[+/-sd] median  max
Total:        0         0   5.16    4   118

Percentage of the requests served within a certain time (ms)
 50%     4
 66%     5
 75%     6
 80%     7
 90%     9
 95%     10
 98%     12
 99%     14
 100%    118 (longest request)

NumGC:12 Pause:58.63ms Pause(Avg):40.91ms Overhead:3.96% Alloc:84.97M Sys:611.67M Alloc(Rate):77.23M/s Histogram:99.82ms 99.82ms 99.82ms

$ echo "  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND";  top -d 1 -b | grep gb
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
27989 brandon   20   0  753m 176m 2844 S  57.9  1.1   0:00.59 gb
27989 brandon   20   0 1012m 234m 2852 S 116.8  1.5   0:01.78 gb
27989 brandon   20   0 1058m 280m 2852 S 107.6  1.8   0:02.87 gb
27989 brandon   20   0 1105m 327m 2852 S 107.7  2.1   0:03.96 gb
27989 brandon   20   0 1194m 346m 2852 S 116.9  2.2   0:05.14 gb
27989 brandon   20   0 1195m 347m 2852 S 107.0  2.2   0:06.22 gb
27989 brandon   20   0 1195m 347m 2852 S 103.9  2.2   0:07.27 gb
27989 brandon   20   0 1275m 348m 2852 S 116.6  2.2   0:08.45 gb
27989 brandon   20   0 1277m 350m 2852 S 111.7  2.2   0:09.58 gb
27989 brandon   20   0 1277m 350m 2852 S 110.8  2.2   0:10.70 gb
27989 brandon   20   0 1349m 351m 2852 S 113.6  2.2   0:11.85 gb
27989 brandon   20   0 1359m 353m 2852 S 115.9  2.2   0:13.02 gb

$ go tool pprof ./gb cpu-27989.pprof 
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 967 samples
      61   6.3%   6.3%       61   6.3% runtime.xchg
      44   4.6%  10.9%       44   4.6% runtime.memmove
      34   3.5%  14.4%       34   3.5% net/textproto.canonicalMIMEHeaderKey
      29   3.0%  17.4%       31   3.2% runtime.settype_flush
      27   2.8%  20.2%      227  23.5% runtime.mallocgc
      26   2.7%  22.9%       26   2.7% runtime.casp
      25   2.6%  25.4%       87   9.0% runtime.MCache_Alloc
      25   2.6%  28.0%       46   4.8% runtime.markallocated
      23   2.4%  30.4%       23   2.4% bytes.IndexByte
      23   2.4%  32.8%       92   9.5% selectgo

跟改變前相比 Alloc(Rate):346.57M/s 降到 Alloc(Rate):77.23M/s CPU 利用率也略有下降，相關 gc 停頓數值下降，throughput 也有較大的提升, 原先 top10 裡佔據 11.7% scanblock 也消失了

UPDATE

更省記憶體的改良辦法

const MaxBufferSize    = 8192

type HTTPWorker struct {
    //ignored
    discard   io.ReaderFrom
}

func NewHttpWorker(...ignored...) *HttpWorker {
    var buf []byte
    contentSize := context.GetInt(FieldContentSize)
    if contentSize < MaxBufferSize {
        buf = make([]byte, contentSize)
    } else {
        buf = make([]byte, MaxBufferSize)
    }

    return &HTTPWorker{
        //ignored
        &Discard{buf},
    }
}

func (h *HttpWorker) send(request *http.Request, executionResult chan<- *Record) {
    // ignored

    resp, _ := h.client.Do(request)

    defer resp.Body.Close()
    contentSize, err = h.discard.ReadFrom(resp.Body)

    // ignored
}

type Discard struct {
    blackHole []byte
}

func (d *Discard) ReadFrom(r io.Reader) (n int64, err error) {
    readSize := 0
    for {
        readSize, err = r.Read(d.blackHole)
        n += int64(readSize)
        if err != nil {
            if err == io.EOF {
                return n, nil
            }
            return
        }
    }
}

附件 profiling 工具

Apr 3, 2013

ab 漸進壓力測試工具

最近在學Go，也用Go寫一個 ab-like 的玩具(Go-HttpBench)，為了要比較兩者效能另外寫了 bash script 用來轉換 ab 報表為行列數據，功能跟 autobench 相似，用階段性漸增的壓力來測試 http server，搜集到的數據可再用 gnuplot 或用 excel 繪制圖型

Example:


brandon@brandon-laptop:~/workspaces/go/gb$ ./ab+.sh -k http://localhost/
Start benchmark
#/usr/bin/ab -c CONCURRENCY=(100 to 10000 step 100) -n NUM=(100*CONCURRENCY) -k http://localhost/
#start time,concurrency,complete requests,failed requests,tps,min,mean,stddev,median,max,90%,95%,99%
Wed Apr  3 11:52:29 CST 2013,100,10000,0,31272.48,0,3,5.4,0,32,11,16,22
Wed Apr  3 11:52:29 CST 2013,200,20000,0,32050.72,0,6,14.2,0,113,23,40,64
Wed Apr  3 11:52:30 CST 2013,300,30000,0,33527.94,0,5,11.6,0,207,16,28,53
Wed Apr  3 11:52:31 CST 2013,400,40000,0,32396.87,0,12,25.0,0,189,48,70,105
Wed Apr  3 11:52:32 CST 2013,500,50000,0,32492.75,0,15,35.9,0,391,51,109,166
Wed Apr  3 11:52:34 CST 2013,600,60000,0,33188.85,0,18,38.8,0,448,73,110,166
Wed Apr  3 11:52:36 CST 2013,700,70000,0,33436.38,0,21,46.9,0,483,71,134,221
Wed Apr  3 11:52:38 CST 2013,800,80000,0,33370.64,0,24,54.5,0,1015,74,145,264
Wed Apr  3 11:52:40 CST 2013,900,90000,0,34093.27,0,26,66.4,0,1150,78,133,260
Wed Apr  3 11:52:43 CST 2013,1000,100000,0,34448.17,0,27,68.4,2,1251,85,136,266
Wed Apr  3 11:52:46 CST 2013,1100,110000,0,34072.87,0,32,78.1,0,1129,101,170,368
Wed Apr  3 11:52:49 CST 2013,1200,120000,0,33332.79,0,36,86.2,0,1032,144,228,386
Wed Apr  3 11:52:53 CST 2013,1300,130000,0,34580.66,0,37,96.0,5,3349,116,201,346
Wed Apr  3 11:52:57 CST 2013,1400,140000,0,33776.24,0,41,96.3,0,689,153,262,475
Wed Apr  3 11:53:01 CST 2013,1500,150000,0,31418.26,0,47,114.7,0,1031,164,295,533
Wed Apr  3 11:53:06 CST 2013,1600,160000,0,33285.65,0,48,111.8,0,629,217,340,492
Wed Apr  3 11:53:11 CST 2013,1700,170000,0,33441.14,0,50,121.5,0,825,211,356,559
Wed Apr  3 11:53:16 CST 2013,1800,180000,0,31452.71,0,52,166.0,0,4253,163,294,496
Wed Apr  3 11:53:22 CST 2013,1900,190000,0,33781.89,0,56,139.8,0,1018,234,425,627
Wed Apr  3 11:53:28 CST 2013,2000,200000,0,33766.12,0,59,137.8,1,1335,265,398,624
...

ab+.sh 使用方式跟 ab 一樣，除了 -c 跟 -n 是由 script 動態來調整外，其它 ab 參數可以從命令列尾端代入，另外測試前要修改壓力參數來給定測試壓力範圍

LOW_CONCURRENCY 初始的 concurreny 數
HIGH_CONCURRENCY 最大的 concurreny 數
RATE_CONCURRENCY 每回測試 concurreny 的累加值
NUM_CALL 單個 concurrent user 要發送的 request 次數

執行結果會顯示在畫面上並輸出至當前目錄下的 csv 檔 e.g. localhost_20130403_1152.csv

ab+.sh

#!/bin/bash
# ab+
#
# os and network tuning
# ulimit -n 100000
# sudo echo "2048 64512" > /proc/sys/net/ipv4/ip_local_port_range
# sudo echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
# sudo echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
# sudo echo "10" > /proc/sys/net/ipv4/tcp_fin_timeout
# sudo echo "65535" > /proc/sys/net/core/somaxconn
# sudo echo "65535" > /proc/sys/net/ipv4/tcp_max_syn_backlog  

AB_PATH=/usr/bin/ab
SS_PATH=/bin/ss

LOW_CONCURRENCY=100
HIGH_CONCURRENCY=1000
CONCURRENCY_STEP=100
NUM_CALL=100

HOST=$(echo "$*" | perl -ne '/http.*?:\/\/([^\/?:#]*)/; print $1')
PREFIX=$HOST$(date +"_%Y%m%d_%H%M")
OUTPUT_FILE=$PREFIX.csv
MAX_OPEN_FILES=$(ulimit -n)

trap "exit;" SIGINT SIGTERM

if [ "$MAX_OPEN_FILES" -le $HIGH_CONCURRENCY ]; then
  echo "Warning: open file limit < HIGH_CONCURRENCY" 
  exit 1
fi

echo "Starting benchmark"

mkdir -p temp/"$PREFIX"

echo "#$AB_PATH -c CONCURRENCY=($LOW_CONCURRENCY to $HIGH_CONCURRENCY step $CONCURRENCY_STEP) -n NUM=($NUM_CALL*CONCURRENCY) $*" | tee "$OUTPUT_FILE"
echo "#start time,concurrency,complete requests,failed requests,tps,min,mean,stddev,median,max,90%,95%,99%" | tee -a "$OUTPUT_FILE"
for (( concurrency=LOW_CONCURRENCY; concurrency<=HIGH_CONCURRENCY; concurrency+=CONCURRENCY_STEP ))
do
  tempFile="temp/$PREFIX/c$concurrency-n$((NUM_CALL*concurrency)).log"
  echo "$AB_PATH -c $concurrency -n $((NUM_CALL*concurrency)) $*" > "$tempFile"
  
  startTime=$(date)
  $AB_PATH -c $concurrency -n $((NUM_CALL*concurrency)) "$@" >> "$tempFile" 2>&1

  if [ $? -ne 0 ]; then
    echo "Error: please check $tempFile"
    exit 1
  fi

  stats=$(egrep "Complete requests:|Failed requests:|Requests per second:|Total:" "$tempFile" | perl -pe 's/[^\d\.]+/ /g; s/^\s+//g;' | perl -pe 's/\s+$//g; s/\s+/,/g')
  histogram=$(egrep "90%|95%|99%" "$tempFile" | cut -c6- | perl -pe 's/[^\d\.]+/ /g; s/^\s+//g;' | perl -pe 's/\s+$//g; s/\s+/,/g')

  echo "$startTime,$concurrency,$stats,$histogram" | tee -a "$OUTPUT_FILE"

  tcpConnections=$($SS_PATH -s | egrep 'TCP:' | perl -ne '/(\d+)/; print $1')
  while [ $((tcpConnections+concurrency+CONCURRENCY_STEP)) -ge "$MAX_OPEN_FILES" ]
  do
    sleep 1
    tcpConnections=$($SS_PATH -s | egrep 'TCP:' | perl -ne '/(\d+)/; print $1')
  done
done

rm -rf temp

echo "Finished benchmark, see $OUTPUT_FILE"

Jan 22, 2012

[Java Performance]Chapter 3:JVM Overview

這篇談JVM的概觀與各元件用途與描述HotSpot VM在各元件上的優化技術

HotSpot VM High Level Architecture

JVM 有三個主要元件：

VM Runtime
JIT Compiler
Memory manager

[Java Performance]Chapter 2:Operating System Performance Monitoring

這篇談到如何在作業系統層級搜集數據(CPU、Memory、Network I/O、Disk I/O)，從數據中分析可能的效能問題

名詞定義：

Performance monitoring：使用非侵入式的方法從作業系統或應用程式搜集、觀察數據，如使用作業系統提供的工具(e.g. vmstat)搜集效能數據，但不會影響應用程式的行為與效能，這方法適用於大部份的環境(production / development/ testing ...)使用
Performance profiling：使用侵入式方法搜集較特定目標數據，如某method的total cpu time/輸入參數，但可能會調整到應用程式行為或者降低效能，這方法比較適用於development/ testing 環境
Performance tuning：通常是 monitoring與profiling之後，發覺問題進而修改程式碼 / 設定參數，增進應用程式效能

[Java Performance]Chapter 1:Strategies, Approaches, and Methodologies

在軟體開發流程中該在何時評估效能？

傳統上瀑布式開發(Analysis > Design > Code > Test)會在測試階段進行效能測式，當效能結果不符合要求，調整代價可能很高，因此在擬定需求時就需要將效能指標定義出來，由需求驅動開發過程中各階段的效能測試

期望的throughput為多少？

期望latency為多少？

要能負載多少的 concurrent users 或者 concurrent tasks？

在最大數量的 concurrent users 或者 concurrent tasks負載下，可接受的latency 或者throughput為多少？

最糟糕狀況下可接受的latency為多少？

garbage collection 引入的延遲否容許？

假設某些use case可能造成高度的效能風險，那應該在Analysis階段之前製作prototype驗證效能的可行性

仙人掌世界

Apr 19, 2013

golang - Profiling Go Programs 記錄(Part 5)

結論

Gotcha

Standard Library

Runtime

Tool

Other Issues

TODO

Reference:

Apr 18, 2013

golang - Profiling Go Programs 記錄(Part 4)

4. 快取計算結果

golang - Profiling Go Programs 記錄(Part 3)

3. 正常結束 Goroutine

golang - Profiling Go Programs 記錄(Part 2)

2. 程式化停止 Timer

golang - Profiling Go Programs 記錄(Part 1)

1. 重複使用 Buffer

UPDATE

Apr 3, 2013

ab 漸進壓力測試工具

Jan 22, 2012

[Java Performance]Chapter 3:JVM Overview

Dec 20, 2011

[Java Performance]Chapter 2:Operating System Performance Monitoring

Dec 15, 2011

[Java Performance]Chapter 1:Strategies, Approaches, and Methodologies