Worker进程时不时busy,strace频繁出现rt_sigreturn

Wakada

问题描述

1、使用GatewayWorker版本是v3.0.13,PHP7.4;
2、目前有7台Gateway服务器,每台规格是32C32G,每台Gateway是96进程数,7台Worker服务器,规格是16C16G,Worker是32进程数;
3、高峰期时,所有Worker进程都出现busy,那会没有把Log截取下来,后面就直接升级机器配置;
4、升级后,status -d,偶尔会出现busy,大概4秒后就消失,strace -p 结果如下,这类busy是因为有终端开了 status -d导致吗?
截图

报错信息

14:50:34.311724 close(1022)             = 0 <0.000054>
14:50:34.311908 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <0.167065>
14:50:34.479180 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:34.479303 write(6, "\f", 1)       = 1 <0.000042>
14:50:34.479420 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000053>
14:50:34.479629 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000244>
14:50:35.480005 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:35.480083 write(6, "\f", 1)       = 1 <0.000035>
14:50:35.480205 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000017>
14:50:35.480328 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000339>
14:50:36.480777 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:36.480845 write(6, "\f", 1)       = 1 <0.000022>
14:50:36.480938 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000011>
14:50:36.481058 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000419>
14:50:37.481553 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:37.481605 write(6, "\f", 1)       = 1 <0.000006>
14:50:37.481647 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000004>
14:50:37.481753 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000506>
14:50:38.482340 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:38.482411 write(6, "\f", 1)       = 1 <0.000010>
14:50:38.482464 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000018>
14:50:38.482618 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000276>
14:50:39.482990 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:39.483063 write(6, "\f", 1)       = 1 <0.000011>
14:50:39.483136 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000005>
14:50:39.484465 munmap(0x7f5961600000, 2097152) = 0 <0.000033>
14:50:39.484553 munmap(0x7f5961400000, 2097152) = 0 <0.000014>
14:50:39.484586 munmap(0x7f5961200000, 2097152) = 0 <0.000013>
14:50:39.484627 munmap(0x7f5961000000, 2097152) = 0 <0.000044>
14:50:39.484740 munmap(0x7f5960e00000, 2097152) = 0 <0.000034>
14:50:39.484835 munmap(0x7f5960c00000, 2097152) = 0 <0.000057>
14:50:39.484986 munmap(0x7f5960a00000, 2097152) = 0 <0.000033>
14:50:39.485071 munmap(0x7f5960800000, 2097152) = 0 <0.000037>
14:50:39.485171 munmap(0x7f5960600000, 2097152) = 0 <0.000036>
14:50:39.485266 munmap(0x7f5960400000, 2097152) = 0 <0.000024>
14:50:39.485372 munmap(0x7f5960200000, 2097152) = 0 <0.000021>
14:50:39.485423 close(1023)             = 0 <0.000040>
14:50:39.485520 close(1024)             = 0 <0.000021>
14:50:39.513202 close(1352)             = 0 <0.000032>
14:50:39.513270 close(1353)             = 0 <0.000035>
14:50:39.513525 sendto(8, "G\0\0\0\26UPDATE `user` SET `logout_t"..., 75, MSG_DONTWAIT, NULL, 0) = 75 <0.000026>
14:50:39.513617 poll([{fd=8, events=POLLIN|POLLERR|POLLHUP}], 1, 86400000) = 1 ([{fd=8, revents=POLLIN}]) <0.000164>
14:50:39.513840 recvfrom(8, "\f\0\0\1\0l\1\0\0\0\0\2\0\0\0\0\27\0\0\2\3def\0\0\0\1?\0\f?"..., 32768, MSG_DONTWAIT, NULL, NULL) = 79 <0.000043>
14:50:39.513994 sendto(8, "\5\0\0\0\31k\1\0\0", 9, MSG_DONTWAIT, NULL, 0) = 9 <0.000019>
14:50:39.514105 sendto(8, "#\0\0\0\27l\1\0\0\0\1\0\0\0\0\1\375\0\375\0\n1746255039\7"..., 39, MSG_DONTWAIT, NULL, 0) = 39 <0.000048>
14:50:39.514248 poll([{fd=8, events=POLLIN|POLLERR|POLLHUP}], 1, 86400000) = 1 ([{fd=8, revents=POLLIN}]) <0.000888>
14:50:39.515209 recvfrom(8, "\7\0\0\1\0\1\0\2\0\0\0", 32768, MSG_DONTWAIT, NULL, NULL) = 11 <0.000023>
14:50:39.515335 alarm(0)                = 24 <0.000016>
14:50:39.515413 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}, {events=EPOLLIN, data={u32=231, u64=231}}, {events=EPOLLIN, data={u32=233, u64=233}}, {events=EPOLLIN, data={u32=332, u64=332}}, {events=EPOLLIN, data={u32=330, u64=330}}, {events=EPOLLIN, data={u32=331, u64=331}}, {events=EPOLLIN, data={u32=333, u64=333}}, {events=EPOLLIN, data={u32=334, u64=334}}, {events=EPOLLIN, data={u32=335, u64=335}}, {events=EPOLLIN, data={u32=337, u64=337}}, {events=EPOLLIN, data={u32=336, u64=336}}, {events=EPOLLIN, data={u32=338, u64=338}}, {events=EPOLLIN, data={u32=339, u64=339}}, {events=EPOLLIN, data={u32=340, u64=340}}, {events=EPOLLIN, data={u32=341, u64=341}}, {events=EPOLLIN, data={u32=342, u64=342}}, {events=EPOLLIN, data={u32=345, u64=345}}, {events=EPOLLIN, data={u32=343, u64=343}}, {events=EPOLLIN, data={u32=344, u64=344}}, {events=EPOLLIN, data={u32=346, u64=346}}, {events=EPOLLIN, data={u32=348, u64=348}}, {events=EPOLLIN, data={u32=347, u64=347}}, {events=EPOLLIN, data={u32=350, u64=350}}, {events=EPOLLIN, data={u32=349, u64=349}}, {events=EPOLLIN, data={u32=351, u64=351}}, {events=EPOLLIN, data={u32=352, u64=352}}, {events=EPOLLIN, data={u32=353, u64=353}}, {events=EPOLLIN, data={u32=355, u64=355}}, {events=EPOLLIN, data={u32=354, u64=354}}, {events=EPOLLIN, data={u32=356, u64=356}}, {events=EPOLLIN, data={u32=358, u64=358}}, {events=EPOLLIN, data={u32=357, u64=357}}, ...], 1024, 626) = 196 <0.000128>
14:50:39.515663 read(5, "\f\f\f\f\f\f", 1024) = 6 <0.000022>
14:50:39.515733 read(5, 0x7f597c37a340, 1024) = -1 EAGAIN (Resource temporarily unavailable) <0.000030>
14:50:39.515830 recvfrom(231, "\0\0\0L\3\n\7\17\340u\311\n\7\t\305\213J\0\0\1\206\1\313\210\0\0\0\32a:1:"..., 65535, 0, NULL, NULL) = 152 <0.000018>
14:50:39.515950 alarm(30)               = 0 <0.000015>
14:50:39.516009 write(1, "[2025-05-03 14:50:39]{\"event\":\"w"..., 44) = 44 <0.000018>
176 4 1
4个回答

Wakada

后面停用status -d ,使用 strace -p -o > xx.log 的方式观察,记录停顿时间,同时查看 status,发现是busy,几次基本都是这个片段select卡住,请问这timeout是什么情况

15:30:15.776606 close(1022)             = 0 <0.000028>
15:30:15.776759 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.630117>
15:30:17.407135 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
15:30:17.407220 write(6, "\f", 1)       = 1 <0.000020>
15:30:17.407287 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000012>
15:30:17.407410 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = 0 (Timeout) <5.005080>
15:30:22.413499 close(1024)             = 0 <0.000059>
15:30:22.413645 close(1025)             = 0 <0.000044>
  • 暂无评论
Wakada

观察后,补充个现象:

目前是7台Gateway,每台96进程,每台Worker服务的单进程需要打开672连接,连上每台gateway的端口;
观察期间的strace(截取片段),发现有周期性的再次打开新的672个连接(对应strace日志的端口数量1353,数量接近),而后又close掉这些刚刚打开的新链接,这期间就会产生busy;

17:07:14.133450 fcntl(1352, F_GETFL)    = 0x2 (flags O_RDWR) <0.000013>
17:07:14.133493 fcntl(1352, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000023>
17:07:14.133581 connect(1352, {sa_family=AF_INET, sin_port=htons(30139), sin_addr=inet_addr("10.7.113.155")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000024>
17:07:14.133655 poll([{fd=1352, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 3000) = 1 ([{fd=1352, revents=POLLOUT}]) <0.000023>
17:07:14.133742 getsockopt(1352, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000024>
17:07:14.133813 fcntl(1352, F_SETFL, O_RDWR) = 0 <0.000021>
17:07:14.133898 sendto(1352, "\0\0\0#\17\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0077979"..., 35, 0, NULL, 0) = 35 <0.000071>
17:07:14.134023 socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 1353 <0.000030>
17:07:14.134090 fcntl(1353, F_GETFL)    = 0x2 (flags O_RDWR) <0.000020>
17:07:14.134155 fcntl(1353, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000017>
17:07:14.134202 connect(1353, {sa_family=AF_INET, sin_port=htons(30155), sin_addr=inet_addr("10.7.113.155")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000042>
17:07:14.134293 poll([{fd=1353, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 3000) = 1 ([{fd=1353, revents=POLLOUT}]) <0.000027>
17:07:14.134365 getsockopt(1353, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000014>
17:07:14.134443 fcntl(1353, F_SETFL, O_RDWR) = 0 <0.000019>

跳过..

17:07:19.695366 close(1347)             = 0 <0.000030>
17:07:19.695442 close(1348)             = 0 <0.000021>
17:07:19.695489 close(1349)             = 0 <0.000032>
17:07:19.695549 close(1350)             = 0 <0.000031>
17:07:19.695611 close(1351)             = 0 <0.000025>
17:07:19.695658 close(1352)             = 0 <0.000022>
17:07:19.695717 close(1353)             = 0 <0.000042>
  • 暂无评论
Wakada

问题大致可以定位到这,正常是681u,不清楚是内部机制还是什么问题,导致进程打开新的链接(超过定义的数量)
截图

  • 暂无评论

你们整个集群gateway进程太多了,导致集群内部出现大量内部通讯,影响了性能。
安装event扩展并优化linux内核后,每个gateway进程能轻松处理支持1-3万的连接。所以要预估好整个系统支持的并发量,设置合理的gateway进程数。比如10万连接,整个集群(注意是整个集群)的geteway进程设置为4-6个gateway进程就好了,多了反而影响性能。

我们有一个物联网系统,大概用了3台4核4G服务器支撑了10万的在线,整个系统负载还极低。

所以你们系统用7台32核跑Gateway进程太浪费了,每台96 Gateway进程太离谱了。
你们可以通过以下策略设置服务器数Gateway进程数,假设前期预期最高4万在线,就使用一台2核2G服务器设置2个Gateway进程数,以后每增加4万在线就增加一台2核2G 2gateway进程的gateway服务器。这样可以做到随时扩容,性能又好。总体原则,gateway进程够用就行,不要开多了。

Worker服务器和进程数看业务情况,一般配置成CPU*4即可,例如16核配置成64个进程。

  • Wakada 8天前

    追踪Worker的busy进程经常会发生:
    10:24:35.955020 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = 0 (Timeout)

    也是因为gateway太多导致么?

  • walkor 8天前

    有可能

  • walkor 8天前

    另外你的php版本也有问题,7.4.1,小版本号是1,这说明是php7.4的第一个版本。不要用小版本号太小的PHP版本,小版本号越低,PHP自身的bug越多。用成熟的版本,例如7.4.33

  • Wakada 6天前

    感谢大佬,调整gateway集群和进程数后正常了

  • 邬綵唔惪 6天前

    老大,一个gateway进程维护2万个链接,如果是消息交互比较频繁的应用的话,这种是直接升级服务器配置,按一核一个gateway,还是直接增加一台2核2G 2gateway进程的gateway服务器呢?

  • walkor 6天前

    理论上都可以。
    如果升级服务器配置需要重启gateway,那么数万个客户端一起重连可能会导致系统过载,所以动态加服务器好一些。

×
🔝