由于找不到具体问题在哪,没办法去对应的社区提问,因此我就只能发这里了,希望有划水路过的大佬帮忙看看,能帮忙解决的话就当我请大佬吃个海底捞。
背景:go 语言写的 http 服务,程序偶然出现无响应( Stop The World 那种无响应),进而导致 Tcp Socket 句柄文件不释放,进而服务器文件句柄数疯涨。故障复现方法未知,可能一周左右出现一次,也可能一个月都不出现。线上 2 个机房,均有发生,发生时都是单机,没有大量出现,重启服务后恢复
PS:目前配置了 ping 接口探活+自动摘流重启。大概最近半年就出现了十次左右,都自动重启恢复了,但是本着探索的精神,还是想找到问题根源
环境:centos7.2+golang1.18.3
调试过程:
1 、监控发现服务异常,文件句柄异常,第一反应是代码问题,没有执行 close 函数导致。但是仔细一想,我 tm 这是 go 项目啊,又不是 c 语言项目,go 项目用的 net.http 包启动的 http 服务,又不是我手动管理的 tcp ,哪特么来的 close 啊。
2 、怀疑是项目其他地方有什么 api 调用,没有关闭连接,排查了很久,并没有发现问题。
3 、发现程序出问题这段时间,连日志都不输出了,程序 debug 接口也挂了(和业务接口不是同一个端口,debug 接口本来用于监控程序 gc 、协程数量等信息的)。开始怀疑不是 close 的问题。
4 、在程序主逻辑启动前,加了一段测试代码,代码如下
go func() {
for {
time.Sleep(1 * time.Minute)
fmt.Println("Hello World!")
}
}()
讲道理,这应该不受任何逻辑影响,始终都该输出吧。加上这段代码,重新发布项目,等了半个月,终于又出问题了。上服务器一看,出问题的时候,这个 HelloWorld 也不输出了。
5 、由于程序完全没响应,任何逻辑都不执行,因此从线上把故障机器摘流,然后使用 dlv 注入进程进行调试。结果如下:
( 1 )、协程总数异常,此时有 5w+个协程,正常情况应该不超过 500 个协程
( 2 )、大量协程被死锁,都是卡在同一个地方
Goroutine 19024799 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684384197s]
( 3 )、查看死锁协程的调用栈,结果与业务代码无关
0 0x000000000043a556 in runtime.gopark
at /usr/local/go1.18.3/src/runtime/proc.go:362
1 0x000000000044b0b3 in runtime.goparkunlock
at /usr/local/go1.18.3/src/runtime/proc.go:367
2 0x000000000044b0b3 in runtime.semacquire1
at /usr/local/go1.18.3/src/runtime/sema.go:144
3 0x00000000004665a5 in sync.runtime_SemacquireMutex
at /usr/local/go1.18.3/src/runtime/sema.go:71
4 0x0000000000473785 in sync.(*Mutex).lockSlow
at /usr/local/go1.18.3/src/sync/mutex.go:162
5 0x00000000004740bf in sync.(*Mutex).Lock
at /usr/local/go1.18.3/src/sync/mutex.go:81
6 0x00000000004740bf in sync.(*Pool).pinSlow
at /usr/local/go1.18.3/src/sync/pool.go:213
7 0x000000000047400e in sync.(*Pool).pin
at /usr/local/go1.18.3/src/sync/pool.go:206
8 0x0000000000473d45 in sync.(*Pool).Get
at /usr/local/go1.18.3/src/sync/pool.go:128
9 0x0000000000711e29 in net/http.newTextprotoReader
at /usr/local/go1.18.3/src/net/http/request.go:994
10 0x0000000000711fd9 in net/http.readRequest
at /usr/local/go1.18.3/src/net/http/request.go:1024
11 0x00000000007181ea in net/http.(*conn).readRequest
at /usr/local/go1.18.3/src/net/http/server.go:988
12 0x000000000071c58b in net/http.(*conn).serve
at /usr/local/go1.18.3/src/net/http/server.go:1891
13 0x0000000000721cce in net/http.(*Server).Serve.func3
at /usr/local/go1.18.3/src/net/http/server.go:3071
14 0x000000000046a641 in runtime.goexit
at /usr/local/go1.18.3/src/runtime/asm_amd64.s:1571
6 、到这来我已经不知道该怎么继续排查了,服务器网络 io 、磁盘 io 、cpu 、内存等待均无明显异常。
7 、找运维在网关上面录制了流量,故障前一分钟都没发现异常流量,应该不存在被攻击的问题
8 、哪位大佬能给分析下呢,顺便给小弟讲解讲解,为啥我那个 HelloWorld 日志也不输出了?
下面附上完整的 dlv grs 结果(部分包名、路径打码了):
(dlv) grs
Goroutine 1 - User: /home/xxxxxx/golibs/rpcserver/v2@v2.0.0/rpcserver/server.go:392 xxxx/golibs/rpcserver/v2/rpcserver.(*Server).waitHttp (0xd0c588) [select 452427h39m57.846100461s]
Goroutine 2 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [force gc (idle) 452385h49m20.252083363s]
Goroutine 3 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [finalizer wait 452427h39m57.869422736s]
Goroutine 4 - User: /home/xxxxxx/devops/statsd@v0.0.13/client_golang/statsdlib/push.go:39 xxxxxx/devops/statsd/client_golang/statsdlib.statsdPushLoop (0x87b17e) [chan receive]
Goroutine 9 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
Goroutine 10 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
Goroutine 15 - User: /home/xxxxxx/xxxxxx/gobiz/logger@v1.3.4/file.go:309 xxxxxx/gobiz/logger.(*fileLog).writeLoop (0x80a545) [select 452427h39m57.846215549s]
Goroutine 16 - User: /home/xxxxxx/xxxxxx/rpc/rpc-go-spl-v2@v0.1.35/rpc/common/watcher.go:49 xxxxxx/rpc/rpc-go-spl-v2/rpc/common.(*DirWatcher).run.func1 (0x84a9f9) [chan receive]
Goroutine 18 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC sweep wait]
Goroutine 19 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC scavenge wait]
Goroutine 27 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
Goroutine 28 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
Goroutine 29 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
Goroutine 30 - User: /home/xxxxxx/xxxxxx/platform-ha/onekey-degrade_sdk_go/v3@v3.2.11/degrade/metric.go:96 xxxxxx/platform-ha/onekey-degrade_sdk_go/v3/degrade.(*metricSender).sendMetricFromChan (0xb05e65) [chan receive 452427h39m57.86959434s]
Goroutine 31 - User: /home/xxxxxx/xxxxxx/platform-ha/onekey-degrade_sdk_go/v3@v3.2.11/degrade/metric.go:96 xxxxxx/platform-ha/onekey-degrade_sdk_go/v3/degrade.(*metricSender).sendMetricFromChan (0xb05e65) [chan receive 452427h39m57.869609575s]
Goroutine 34 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.683989625s]
Goroutine 35 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684006698s]
Goroutine 36 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684023352s]
Goroutine 37 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684037091s]
Goroutine 38 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684051841s]
Goroutine 39 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684066197s]
Goroutine 40 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684082165s]
Goroutine 41 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684095326s]
Goroutine 42 - User: /home/xxxxxx/xxxxxx/xxxx/golibs/metrics@v0.2.3/metrics.go:280 xxxxxx/xxxx/golibs/metrics.(*Sender).asyncSendLoop.func1 (0x880a5b) [select 452391h19m51.199497773s]
Goroutine 43 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.68413238s]
Goroutine 45 - User: /usr/local/go1.18.3/src/runtime/time.go:194 time.Sleep (0x46742e) [sleep]
Goroutine 46 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.8518662s]
Goroutine 48 - User: /gaia/workspace-job/xxxxxx/xxxxx/common/handlers/apollo/apollo.go:50 xxxxx/common/handlers/apollo.OnConfigUpdate.func1 (0x884859) [chan receive 452427h39m57.863062571s]
Goroutine 49 - User: /gaia/workspace-job/xxxxxx/xxxxx/common/handlers/apollo/apollo.go:50 xxxxx/common/handlers/apollo.OnConfigUpdate.func1 (0x884859) [chan receive 452427h39m57.846560172s]
Goroutine 67 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
Goroutine 68 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
Goroutine 69 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
Goroutine 72 - User: /home/xxxxxx/github.com/getsentry/sentry-go@v0.10.0/transport.go:378 github.com/getsentry/sentry-go.(*HTTPTransport).worker (0x8010ec) [chan receive 452391h19m51.973327187s]
Goroutine 73 - User: /home/xxxxxx/xxxxxx/rpc/rpc-go-spl-v2@v0.1.35/rpc/common/watcher.go:49 xxxxxx/rpc/rpc-go-spl-v2/rpc/common.(*DirWatcher).run.func1 (0x84a9f9) [chan receive]
Goroutine 75 - User: /usr/local/go1.18.3/src/database/sql/sql.go:1226 database/sql.(*DB).connectionOpener (0xa3d38d) [select 452427h39m57.863183594s]
Goroutine 84 - User: /usr/local/go1.18.3/src/runtime/sigqueue.go:151 os/signal.signal_recv (0x466bef) (thread 1238)
Goroutine 98 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait]
Goroutine 99 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait 452427h39m57.846699823s]
Goroutine 100 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684348233s]
Goroutine 101 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait]
Goroutine 102 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [select 452427h39m57.846735781s]
* Goroutine 19024799 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684384197s]
Goroutine 19168258 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684396903s]
Goroutine 19913951 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684409102s]
Goroutine 19958391 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684421539s]
Goroutine 20081541 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684433615s]
Goroutine 20116495 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684446142s]
Goroutine 20172239 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.68446185s]
......(后面都是重复的,就省略了)
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.