Go標準庫http與fasthttp伺服器端效能對比場景分析

2022-06-15 14:02:08

1. 背景

Go初學者學習Go時，在編寫了經典的“hello, world”程式之後，可能會迫不及待的體驗一下Go強大的標準庫，比如：用幾行程式碼寫一個像下面範例這樣擁有完整功能的web server：

// 來自https://tip.golang.org/pkg/net/http/#example_ListenAndServe
package main
import (
    "io"
    "log"
    "net/http"
)
func main() {
    helloHandler := func(w http.ResponseWriter, req *http.Request) {
        io.WriteString(w, "Hello, world!n")
    }
    http.HandleFunc("/hello", helloHandler)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

go net/http包是一個比較均衡的通用實現，能滿足大多數gopher 90%以上場景的需要，並且具有如下優點：

標準庫包，無需引入任何第三方依賴；
對http規範的滿足度較好；
無需做任何優化，即可獲得相對較高的效能；
支援HTTP代理；
支援HTTPS；
無縫支援HTTP/2。

不過也正是因為http包的“均衡”通用實現，在一些對效能要求嚴格的領域，net/http的效能可能無法勝任，也沒有太多的調優空間。這時我們會將眼光轉移到其他第三方的http伺服器端框架實現上。

而在第三方http伺服器端框架中，一個“行如其名”的框架fasthttp被提及和採納的較多，fasthttp官網宣稱其效能是net/http的十倍(基於go test benchmark的測試結果)。

fasthttp採用了許多效能優化上的最佳實踐，尤其是在記憶體物件的重用上，大量使用sync.Pool以降低對Go GC的壓力。

那麼在真實環境中，到底fasthttp能比net/http快多少呢？恰好手裡有兩臺效能還不錯的伺服器可用，在本文中我們就在這個真實環境下看看他們的實際效能。

2. 效能測試

我們分別用net/http和fasthttp實現兩個幾乎“零業務”的被測程式：

nethttp:

// github.com/bigwhite/experiments/blob/master/http-benchmark/nethttp/main.go
package main
import (
    _ "expvar"
    "log"
    "net/http"
    _ "net/http/pprof"
    "runtime"
    "time"
)
func main() {
    go func() {
        for {
            log.Println("當前routine數量:", runtime.NumGoroutine())
            time.Sleep(time.Second)
        }
    }()

    http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("Hello, Go!"))
    }))

    log.Fatal(http.ListenAndServe(":8080", nil))
}

fasthttp:

// github.com/bigwhite/experiments/blob/master/http-benchmark/fasthttp/main.go
package main
import (
    "fmt"
    "log"
    "net/http"
    "runtime"
    "time"
    _ "expvar"
    _ "net/http/pprof"
    "github.com/valyala/fasthttp"
)
type HelloGoHandler struct {
}
func fastHTTPHandler(ctx *fasthttp.RequestCtx) {
    fmt.Fprintln(ctx, "Hello, Go!")
}
func main() {
    go func() {
        http.ListenAndServe(":6060", nil)
    }()
    go func() {
        for {
            log.Println("當前routine數量:", runtime.NumGoroutine())
            time.Sleep(time.Second)
        }
    }()
    s := &fasthttp.Server{
        Handler: fastHTTPHandler,
    }
    s.ListenAndServe(":8081")
}

對被測目標實施壓力測試的使用者端，我們基於hey這個http壓測工具進行，為了方便調整壓力水平，我們將hey“包裹”在下面這個shell指令碼中(僅適於在linux上執行)：

// github.com/bigwhite/experiments/blob/master/http-benchmark/client/http_client_load.sh
# ./http_client_load.sh 3 10000 10 GET http://10.10.195.181:8080
echo "$0 task_num count_per_hey conn_per_hey method url"
task_num=$1
count_per_hey=$2
conn_per_hey=$3
method=$4
url=$5
start=$(date +%s%N)
for((i=1; i<=$task_num; i++)); do {
    tm=$(date +%T.%N)
        echo "$tm: task $i start"
    hey -n $count_per_hey -c $conn_per_hey -m $method $url > hey_$i.log
    tm=$(date +%T.%N)
        echo "$tm: task $i done"
} & done
wait
end=$(date +%s%N)
count=$(( $task_num * $count_per_hey ))
runtime_ns=$(( $end - $start ))
runtime=`echo "scale=2; $runtime_ns / 1000000000" | bc`
echo "runtime: "$runtime
speed=`echo "scale=2; $count / $runtime" | bc`
echo "speed: "$speed

該指令碼的執行範例如下：

bash http_client_load.sh 8 1000000 200 GET http://10.10.195.134:8080
http_client_load.sh task_num count_per_hey conn_per_hey method url
16:58:09.146948690: task 1 start
16:58:09.147235080: task 2 start
16:58:09.147290430: task 3 start
16:58:09.147740230: task 4 start
16:58:09.147896010: task 5 start
16:58:09.148314900: task 6 start
16:58:09.148446030: task 7 start
16:58:09.148930840: task 8 start
16:58:45.001080740: task 3 done
16:58:45.241903500: task 8 done
16:58:45.261501940: task 1 done
16:58:50.032383770: task 4 done
16:58:50.985076450: task 7 done
16:58:51.269099430: task 5 done
16:58:52.008164010: task 6 done
16:58:52.166402430: task 2 done
runtime: 43.02
speed: 185960.01

從傳入的引數來看，該指令碼並行啟動了8個task(一個task啟動一個hey)，每個task向http://10.10.195.134:8080建立200個並行連線，並行送100w http GET請求。

我們使用兩臺伺服器分別放置被測目標程式和壓力工具指令碼：

目標程式所在伺服器：10.10.195.181(物理機，Intel x86-64 CPU，40核，128G記憶體, CentOs 7.6)

$ cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core) 

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
座：                 2
NUMA 節點：         2
廠商 ID：           GenuineIntel
CPU 系列：          6
型號：              85
型號名稱：        Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
步進：              4
CPU MHz：             800.000
CPU max MHz:           2201.0000
CPU min MHz:           800.0000
BogoMIPS：            4400.00
虛擬化：           VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           1024K
L3 快取：           14080K
NUMA 節點0 CPU：    0-9,20-29
NUMA 節點1 CPU：    10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d

壓力工具所在伺服器：10.10.195.133(物理機，鯤鵬arm64 cpu，96核，80G記憶體, CentOs 7.9)

# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (AltArch)

# lscpu
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    1
Core(s) per socket:    48
座：                 2
NUMA 節點：         4
型號：              0
CPU max MHz:           2600.0000
CPU min MHz:           200.0000
BogoMIPS：            200.00
L1d 快取：          64K
L1i 快取：          64K
L2 快取：           512K
L3 快取：           49152K
NUMA 節點0 CPU：    0-23
NUMA 節點1 CPU：    24-47
NUMA 節點2 CPU：    48-71
NUMA 節點3 CPU：    72-95
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

我用dstat監控被測目標所在主機資源佔用情況(dstat -tcdngym)，尤其是cpu負荷；通過[expvarmon監控memstats]，由於沒有業務，記憶體佔用很少；通過go tool pprof檢視目標程式中對各類資源消耗情況的排名。

下面是多次測試後製作的一個資料表格：

圖：測試資料

3. 對結果的簡要分析

受特定場景、測試工具及指令碼精確性以及壓力測試環境的影響，上面的測試結果有一定侷限，但卻真實反映了被測目標的效能趨勢。我們看到在給予同樣壓力的情況下，fasthttp並沒有10倍於net http的效能，甚至在這樣一個特定的場景下，兩倍於net/http的效能都沒有達到：我們看到在目標主機cpu資源消耗接近70%的幾個用例中，fasthttp的效能僅比net/http高出30%~70%左右。

那麼為什麼fasthttp的效能未及預期呢？要回答這個問題，那就要看看net/http和fasthttp各自的實現原理了！我們先來看看net/http的工作原理示意圖：

圖：nethttp工作原理示意圖

http包作為server端的原理很簡單，那就是accept到一個連線(conn)之後，將這個conn甩給一個worker goroutine去處理，後者一直存在，直到該conn的生命週期結束：即連線關閉。

下面是fasthttp的工作原理示意圖：

圖：fasthttp工作原理示意圖

而fasthttp設計了一套機制，目的是儘量複用goroutine，而不是每次都建立新的goroutine。fasthttp的Server accept一個conn之後，會嘗試從workerpool中的ready切片中取出一個channel，該channel與某個worker goroutine一一對應。一旦取出channel，就會將accept到的conn寫到該channel裡，而channel另一端的worker goroutine就會處理該conn上的資料讀寫。當處理完該conn後，該worker goroutine不會退出，而是會將自己對應的那個channel重新放回workerpool中的ready切片中，等待這下一次被取出。

fasthttp的goroutine複用策略初衷很好，但在這裡的測試場景下效果不明顯，從測試結果便可看得出來，在相同的使用者端並行和壓力下，net/http使用的goroutine數量與fasthttp相差無幾。這是由測試模型導致的：在我們這個測試中，每個task中的hey都會向被測目標發起固定數量的[長連線(keep-alive)]，然後在每條連線上發起“飽和”請求。這樣fasthttp workerpool中的goroutine一旦接收到某個conn就只能在該conn上的通訊結束後才能重新放回，而該conn直到測試結束才會close，因此這樣的場景相當於讓fasthttp“退化”成了net/http的模型，也染上了net/http的“缺陷”：goroutine的數量一旦多起來，go runtime自身排程所帶來的消耗便不可忽視甚至超過了業務處理所消耗的資源佔比。下面分別是fasthttp在200長連線、8000長連線以及16000長連線下的cpu profile的結果：

200長連線：

(pprof) top -cum
Showing nodes accounting for 88.17s, 55.35% of 159.30s total
Dropped 150 nodes (cum <= 0.80s)
Showing top 10 nodes out of 60
      flat  flat%   sum%        cum   cum%
     0.46s  0.29%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*Server).serveConn
         0     0%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.04s 0.025%  0.31%     89.46s 56.16%  internal/poll.ignoringEINTRIO (inline)
    87.38s 54.85% 55.17%     89.27s 56.04%  syscall.Syscall
     0.12s 0.075% 55.24%     60.39s 37.91%  bufio.(*Writer).Flush
         0     0% 55.24%     60.22s 37.80%  net.(*conn).Write
     0.08s  0.05% 55.29%     60.21s 37.80%  net.(*netFD).Write
     0.09s 0.056% 55.35%     60.12s 37.74%  internal/poll.(*FD).Write
         0     0% 55.35%     59.86s 37.58%  syscall.Write (inline)
(pprof) 

8000長連線：

(pprof) top -cum
Showing nodes accounting for 108.51s, 54.46% of 199.23s total
Dropped 204 nodes (cum <= 1s)
Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
         0     0%     0%    119.11s 59.79%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0%     0%    119.11s 59.79%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.69s  0.35%  0.35%    119.05s 59.76%  github.com/valyala/fasthttp.(*Server).serveConn
     0.04s  0.02%  0.37%    104.22s 52.31%  internal/poll.ignoringEINTRIO (inline)
   101.58s 50.99% 51.35%    103.95s 52.18%  syscall.Syscall
     0.10s  0.05% 51.40%     79.95s 40.13%  runtime.mcall
     0.06s  0.03% 51.43%     79.85s 40.08%  runtime.park_m
     0.23s  0.12% 51.55%     79.30s 39.80%  runtime.schedule
     5.67s  2.85% 54.39%     77.47s 38.88%  runtime.findrunnable
     0.14s  0.07% 54.46%     68.96s 34.61%  bufio.(*Writer).Flush

16000長連線：

(pprof) top -cum
Showing nodes accounting for 239.60s, 87.07% of 275.17s total
Dropped 190 nodes (cum <= 1.38s)
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
     0.04s 0.015% 0.015%    153.38s 55.74%  runtime.mcall
     0.01s 0.0036% 0.018%    153.34s 55.73%  runtime.park_m
     0.12s 0.044% 0.062%       153s 55.60%  runtime.schedule
     0.66s  0.24%   0.3%    152.66s 55.48%  runtime.findrunnable
     0.15s 0.055%  0.36%    127.53s 46.35%  runtime.netpoll
   127.04s 46.17% 46.52%    127.04s 46.17%  runtime.epollwait
         0     0% 46.52%       121s 43.97%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0% 46.52%       121s 43.97%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.41s  0.15% 46.67%    120.18s 43.67%  github.com/valyala/fasthttp.(*Server).serveConn
   111.17s 40.40% 87.07%    111.99s 40.70%  syscall.Syscall
(pprof)

通過上述profile的比對，我們發現當長連線數量增多時(即workerpool中goroutine數量增多時），go runtime排程的佔比會逐漸提升，在16000連線時，runtime排程的各個函數已經排名前4了。

4. 優化途徑

從上面的測試結果，我們看到fasthttp的模型不太適合這種連線連上後進行持續“飽和”請求的場景，更適合短連線或長連線但沒有持續飽和請求，在後面這樣的場景下，它的goroutine複用模型才能更好的得以發揮。

但即便“退化”為了net/http模型，fasthttp的效能依然要比net/http略好，這是為什麼呢？這些效能提升主要是fasthttp在記憶體分配層面的優化trick的結果，比如大量使用sync.Pool，比如避免在[]byte和string互轉等。

那麼，在持續“飽和”請求的場景下，如何讓fasthttp workerpool中goroutine的數量不會因conn的增多而線性增長呢？fasthttp官方沒有給出答案，但一條可以考慮的路徑是使用os的多路複用(linux上的實現為epoll)，即go runtime netpoll使用的那套機制。在多路複用的機制下，這樣可以讓每個workerpool中的goroutine處理同時處理多個連線，這樣我們可以根據業務規模選擇workerpool池的大小，而不是像目前這樣幾乎是任意增長goroutine的數量。當然，在使用者層面引入epoll也可能會帶來系統呼叫佔比的增多以及響應延遲增大等問題。至於該路徑是否可行，還是要看具體實現和測試結果。

注：fasthttp.Server中的Concurrency可以用來限制workerpool中並行處理的goroutine的個數，但由於每個goroutine只處理一個連線，當Concurrency設定過小時，後續的連線可能就會被fasthttp拒絕服務。因此fasthttp的預設Concurrency為：

const DefaultConcurrency = 256 * 1024

到此這篇關於Go標準庫http與fasthttp伺服器端效能比較的文章就介紹到這了,更多相關go http與fasthttp伺服器端效能內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com！

Go標準庫http與fasthttp伺服器端效能對比場景分析

目錄

1. 背景

2. 效能測試

3. 對結果的簡要分析

4. 優化途徑

熱門文章