仙人掌世界: [Java Performance]Chapter 2:Operating System Performance Monitoring

這篇談到如何在作業系統層級搜集數據(CPU、Memory、Network I/O、Disk I/O)，從數據中分析可能的效能問題

名詞定義：

Performance monitoring：使用非侵入式的方法從作業系統或應用程式搜集、觀察數據，如使用作業系統提供的工具(e.g. vmstat)搜集效能數據，但不會影響應用程式的行為與效能，這方法適用於大部份的環境(production / development/ testing ...)使用
Performance profiling：使用侵入式方法搜集較特定目標數據，如某method的total cpu time/輸入參數，但可能會調整到應用程式行為或者降低效能，這方法比較適用於development/ testing 環境
Performance tuning：通常是 monitoring與profiling之後，發覺問題進而修改程式碼 / 設定參數，增進應用程式效能

CPU Utilization
應用程式有沒有最大化利用CPU效能，可經由分析作業系統的CPU運作數據得知，一般作業系統會把CPU Utilization區分為User CPU Utilization與 System(kernel) CPU Utilization，過高的System CPU Utilization表示可能有共用資源競爭(ie. lock contention)或大量 I/O 裝置互動，經由降低System CPU Utilization可以提高應用程式performance與 scalability

當CPU從 register/cache 找不到指令，等待從memory取得指令的狀況叫作stall(耗費數百個CPU clock cycles)，經由降低stall能增進CPU效能，如降低context switch來減少cpu cache miss

數據搜集工具：

Windows	Linux
(GUI)Performance Monitor typeperf	(GUI)gnome-system-monitor (GUI)xosview vmstat mpstat sar

typeperf

vmstat(us/sy)

CPU Scheduler Run Queue
單個CPU同時只能處理一項工作，當沒有CPU資源可用時，作業系統會將新的工作存放在CPU scheduler's run queue之中，等CPU有空閒時再從run queue中取出執行，如果累積的工作過多(Run Queue Depth / CPU virtual Processors > 4)則會觀察到回應速度逐漸緩慢，那就要注意或者進行調整，如增加更多的CPU並分散工作、調整演算法與資料結構

數據搜集工具：

Windows	Linux
(GUI)Performance Monitor typeperf	vmstat mpstat load average

typeperf

vmstat(r)

uptime(load average 也是一種觀測的方法)

Memory Utilization
應用程式向作業系統申請memory時，若作業系統認為physical memory不足則會使用disk(ie. virtual memory)替代memory避免應用程式崩潰，但由於disk IO較physical memory慢許多倍(millisecond vs nanosecond)，然而JVM's garbage collector在進行gc會掃描所有java heap中的物件，若物件存放在disk那就會被disk io速度拖慢，gc的"stop the world"時間就越長，可觀察

swapping(page in/page out)行為得知

數據搜集工具：

Windows	Linux
(GUI)Performance Monitor typeperf	vmstat top

typeperf(5 second intervals)

vmstat(swpd/free/si/so)

Monitoring Lock Contention
許多Java應用程式不能scale是遇到了lock contention(e.g. 大量的thread排隊等待取得鎖住的共用的物件)，像是synchronized method 或 synchronized blocks，經由觀察voluntary context switch(很昂貴，需要耗費80,000 clock cycles)，如果voluntary context switch佔用 3% 之上的CPU clock cycles 則可能有lock contention

可用的數據搜集工具：

Windows	Linux
Intel VTune AMD CodeAnalyst	pidstat

pidstat(cswch:voluntary context switches)

假設上面的結果是在3.0Ghz dual core的 CPU下運作，估算平均每秒有3500次voluntary context switches
每個voluntary context switches 成本 80,000 clock cycles
單個virtual Processor的voluntary context switches 成本為 3500 / 2 * 80,000 = 140,000,000 clock cycles
單個virtual Processor的CPU clock cycle在3,000,000,000
平均佔用CPU clock cycles 為 140,000,000 / 3,000,000,000 = 4.7%
4.7% > 3% 所以可能有lock contention

Monitoring Involuntary Context Switches
involuntary context switches 發生在超出執行時間或者被更高priority的thread佔用時，觸發的context switch若有很高的involuntary context switches 表示同時有太多的thread在等待執行，從high run queue depth、high System CPU utilization、high number of migrations 同樣可以看出來thread過多的狀況

(數據搜集工具同voluntary context switches)

Monitoring Thread migrations
在多核的狀況下會有ready-to-run threads在多核中轉移(migrations)因而降低效能，可經由processor set綁定java process避免這個狀況，在linux環境可以使用taskset指定processor set

Network I/O Utilization
受限於network bandwidth 或 netwok I/O performance 也許會造成應用程式 performance與scalability的限制，經由更換韌體/硬體與網路環境改善限制

大量的讀寫小量資料會使用到大量的system CPU(high number of system calls)，可採用nonblocking Java NIO取代blocking java.net.Socket，經由降低thread緩充讀寫的次數改善應用程式效能

數據搜集工具：