----------------------------------------------GLOBAL STATUS----------------------------------------------------
Workerman version:3.5.31 PHP version:7.4.16
start time:2021-08-17 14:07:15 run 336 days 21 hours
load average: 0.02, 0, 0 event-loop:\Workerman\Events\Event
3 workers 7 processes
worker_name exit_status exit_count
Register 0 0
BusinessWorker 0 16
BusinessWorker 64000 5
Gateway 0 0
----------------------------------------------PROCESS STATUS---------------------------------------------------
pid memory listening worker_name connections send_fail timers total_request qps status
1957 6M text://172.24.171.109:1236 Register 6 0 0 7762771 0 [idle]
1962 17.77M websocket://0.0.0.0:9502 Gateway 30 913913 5 117133951 0 [idle]
1963 17.77M websocket://0.0.0.0:9502 Gateway 21 920443 5 117666307 0 [idle]
8701 N/A none BusinessWorker N/A N/A N/A N/A N/A [busy]
9574 10M none BusinessWorker 3 0 1 23086407 0 [idle]
9576 10M none BusinessWorker 3 0 1 23071418 0 [idle]
10215 10M none BusinessWorker 3 0 1 1024 0 [idle]
----------------------------------------------PROCESS STATUS---------------------------------------------------
Summary 70M - - 66 1834356 13 288721878 0 [Summary]
8701 N/A none BusinessWorker N/A N/A N/A N/A N/A [busy]
strace -ttp 8701
strace: Process 8701 attached
11:30:30.102970 restart_syscall(<... resuming interrupted restart_syscall ...>
) = 0
11:30:51.310705 poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 0) = 0 (Timeout)
11:30:51.310797 poll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, 60000^Cstrace: Process 8701 detached
<detached ...>
进程跟踪一直被阻塞上面这种情况,无任何响应
按道理应该是下面这种才是正常的
11:41:44.166569 recvfrom(12, "+OK\r\n", 8192, MSG_DONTWAIT, NULL, NULL) = 5
11:41:44.166655 poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 0) = 0 (Timeout)
11:41:44.166744 sendto(12, "*3\r\n$4\r\nHGET\r\n$10\r\nlive_layer\r\n$"..., 45, MSG_DONTWAIT, NULL, 0) = 45
11:41:44.166830 poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 0) = 0 (Timeout)
11:41:44.166908 poll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, 60000) = 1 ([{fd=12, revents=POLLIN}])
11:41:44.168704 recvfrom(12, "$257\r\n{\"msg_type\":\"layer_message"..., 8192, MSG_DONTWAIT, NULL, NULL) = 265
11:41:44.168806 poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 0) = 0 (Timeout)
11:41:44.168887 sendto(12, "*3\r\n$4\r\nHGET\r\n$10\r\nlive_layer\r\n$"..., 45, MSG_DONTWAIT, NULL, 0) = 45
11:41:44.168989 poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 0) = 0 (Timeout)
11:41:44.169069 poll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, 60000) = 1 ([{fd=12, revents=POLLIN}])
11:41:44.170843 recvfrom(12, "$254\r\n{\"msg_type\":\"layer_message"..., 8192, MSG_DONTWAIT, NULL, NULL) = 262
11:41:44.170977 sendto(11, "\0\0\1\271\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0j\1\0\0\0\0\0\0{\"ms"..., 441, 0, NULL, 0) = 441
11:41:44.171176 alarm(0) = 30
11:41:44.172994 recvfrom(10, "\0\0\0l\3\254\30\253m\7\320\254\30\253m\341N\0\0\0\27\1%\36\0\0\0=a:2:"..., 65535, 0, NULL, NULL) = 108
11:41:44.173124 alarm(30) = 0
11:41:44.173218 close(12) = 0
11:41:44.173300 stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=362, ...}) = 0
11:41:44.173387 openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 12
11:41:44.173467 fstat(12, {st_mode=S_IFREG|0644, st_size=277, ...}) = 0
11:41:44.173544 read(12, "127.0.0.1\tlocalhost\n\n# The follo"..., 4096) = 277
11:41:44.173624 read(12, "", 4096) = 0
11:41:44.173699 close(12) = 0
11:41:44.173807 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 12
11:41:44.173889 connect(12, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 16) = 0
11:41:44.173991 poll([{fd=12, events=POLLOUT}], 1, 0) = 1 ([{fd=12, revents=POLLOUT}])
11:41:44.174072 sendto(12, ":\271\1\0\0\1\0\0\0\0\0\0\24r-bp1ob5zk2hcwfs58n"..., 61, MSG_NOSIGNAL, NULL, 0) = 61
11:41:44.174233 poll([{fd=12, events=POLLIN}], 1, 2000) = 1 ([{fd=12, revents=POLLIN}])
11:41:44.174327 ioctl(12, FIONREAD, [77]) = 0
运行 lsof -np 8701 看下fd为12的资源是什么
业务紧急,我直接给重启了
这个应该是businessworker进程卡在
poll([{fd=12,
那里导致的,那个businessworker进程长时间无法接收gateway发过来的信息嗯,我下次按照这个手册排查一下,https://www.workerman.net/doc/gateway-worker/send-buffer-overflow.html
这个问题又出现了了
内存和CPU使用情况
看起来是在等172.30.237.31:6379这个地址返回数据。redis用得是自建redis还是阿里云
阿里云的
阿里云的redis最好用个定时器,定时发一个心跳数据,维持连接,时间下小于60秒,比如55。
阿里云redis可能会在redis连接空闲1分钟后清理连接,不发fin包通知那种,导致redis扩展无法知道连接已经不可用,认为连接仍然存活,但是数据一直收不到。
1
请问阿里云redis加定时器, 是指在定时任务中加一个每55秒连一下随意查个key意思吗
对
差不多一两年前,在阿里云上遇到一个奇怪的 Redis 连接问题,每隔十来分钟,服务里的 Redis client 库就报告连接 Redis server 超时,当时花了很大功夫,发现是阿里云会断开长时间闲置的 TCP 连接,不给两头发 FIN or RST 包,而当时我们的 Redis server 没有打开 tcp_keepalive 选项,于是 Redis server 侧那个连接还存在于 Linux conntrack table 里,而 Redis client 侧由于连接池重用连接进行 get、set 发现连接坏掉就关闭了,所以 client 侧的对应 local port 回收了,当接下来 Redis 重用这个 local port 向 Redis server 发起连接时,由于 Redis server 侧的 conntrack table 里 <client_ip, client_port, redis-server, 6379> 四元组对应状态是 ESTABLISHED,所以自然客户端发来的 TCP SYN packet 被丢弃,Redis client 看到的现象就是连接超时。
https://zhuanlan.zhihu.com/p/52622856
真好,利用大佬的教训,免费给我们培训
你这个小机灵鬼😆