[asterisk-bugs] [JIRA] (ASTERISK-25635) run_agi() while() loop loops indefinitely because of fgets() returns EAGAIN

Mon Jan 11 02:55:33 CST 2016

    [ https://issues.asterisk.org/jira/browse/ASTERISK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=228926#comment-228926 ] 

Oleksandr Natalenko commented on ASTERISK-25635:
------------------------------------------------

So, should we add

{code}
clearerr(readf);
{code}

at the end of while() block?

> run_agi() while() loop loops indefinitely because of fgets() returns EAGAIN
> ---------------------------------------------------------------------------
>
>                 Key: ASTERISK-25635
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25635
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_agi
>    Affects Versions: 13.6.0
>         Environment: CentOS 7.1.1503, KVM
>            Reporter: Oleksandr Natalenko
>
> We've migrated our corporate PBX setup from Asterisk 1.8.7.1 (hardware server) to 13.6.0 (KVM VM), and faced the following issue.
> We use 2 VM instances, one is for Asterisk itself, another is for PBX business logic. Asterisk uses FastAGI to call remote functions, PBX logic server is based on latest pystrix code with custom handlers.
> The issue is that occasionally (with no visible reason several times per several days) one or two Asterisk threads starts to consume 100% of CPU time. We've got the stacktrace of such a thread via gdb (in all cases stacktrace was the same pointing to the same source code line):
> {code}
> 0x00007f39ea02d9fd in read () from /lib64/libc.so.6
> #0  0x00007f39ea02d9fd in read () from /lib64/libc.so.6
> #1  0x00007f39e9fbe448 in __GI__IO_file_underflow () from /lib64/libc.so.6
> #2  0x00007f39e9fbf3be in __GI__IO_default_uflow () from /lib64/libc.so.6
> #3  0x00007f39e9fb3454 in __GI__IO_getline_info () from /lib64/libc.so.6
> #4  0x00007f39e9fb237f in fgets () from /lib64/libc.so.6
> #5  0x00007f3981cc53e9 in fgets (__stream=0x7f39a42d07c0, __n=2048, __s=0x7f392eba3da0 "") at /usr/include/bits/stdio2.h:263
> #6  run_agi (argv=<optimized out>, argc=<optimized out>, dead=<optimized out>, status=0x7f392eba3520, pid=<optimized out>, agi=0x7f392eba3540, request=<optimized out>, 
> chan=0x7f39902e9f98) at res_agi.c:3830
> #7  agi_exec_full (chan=<optimized out>, data=<optimized out>, enhanced=<optimized out>, dead=<optimized out>) at res_agi.c:4154
> #8  0x0000000000565159 in pbx_exec (c=c at entry=0x7f39902e9f98, app=app at entry=0x1fcfbf0, data=data at entry=0x7f392eba4b60 "agi://x.y.z.c:4573/some_script, 130469, , 2, ru") at 
> pbx.c:1720
> #9  0x000000000056fc29 in pbx_extension_helper (c=c at entry=0x7f39902e9f98, context=0x7f39902ea950 "preprocessing", exten=exten at entry=0x7f39902ea9a0 "38044xxxxxxx", 
> priority=priority at entry=24, label=label at entry=0x0, callerid=callerid at entry=0x7f39a426fd20 "38050xxxxxxx", action=action at entry=E_SPAWN, found=found at entry=0x7f392eba6be0, 
> combined_find_spawn=combined_find_spawn at entry=1, con=0x0) at pbx.c:4992
> #10 0x0000000000575d8b in ast_spawn_extension (combined_find_spawn=1, found=0x7f392eba6be0, callerid=0x7f39a426fd20 "38050xxxxxxx", priority=24, exten=0x7f39902ea9a0 "38044xxxxxxx", 
> context=<optimized out>, c=0x7f39902e9f98) at pbx.c:6162
> #11 __ast_pbx_run (c=c at entry=0x7f39902e9f98, args=args at entry=0x0) at pbx.c:6579
> #12 0x000000000057727b in pbx_thread (data=data at entry=0x7f39902e9f98) at pbx.c:6899
> #13 0x00000000005d598a in dummy_start (data=<optimized out>) at utils.c:1237
> #14 0x00007f39ead5bdf3 in start_thread () from /lib64/libpthread.so.0
> #15 0x00007f39ea03c1ad in clone () from /lib64/libc.so.6
> {code}
> The reason the thread is consuming 100% of CPU time is that while() loop stuck looping because of fgets() (res/res_agi.c:3830) constantly returns EAGAIN (we've checked that with strace). We've enabled AGI debug and got the following:
> {code}
> [Dec 16 19:50:01] VERBOSE[6767][C-00002c40] res_agi.c: AGI Rx << temp buffer  - errno Resource temporarily unavailable
> No \n received, checking again.
> {code}
> There is also another debug log snippet showing warnings regarding the call in question:
> {code}
> [Dec 16 19:01:44] VERBOSE[6767][C-00002c40] pbx.c: Executing [s at recall_www:28] AGI("IAX2/mtc-16527", "agi://x.y.z.c:4573/some_script, 050xxxxxxx") in new stack
> [Dec 16 19:01:46] WARNING[29968] channel.c: Exceptionally long voice queue length queuing to IAX2/mtc-16527  
> [Dec 16 19:01:47] WARNING[29964] channel.c: Exceptionally long voice queue length queuing to IAX2/mtc-16527  
> [Dec 16 19:01:49] WARNING[29970] channel.c: Exceptionally long voice queue length queuing to IAX2/mtc-16527  
> [Dec 16 19:02:25] WARNING[29965] chan_iax2.c: Max retries exceeded to host 194.50.85.126 on IAX2/mtc-16527 (type = 6, subclass = 11, ts=10009, seqno=2)
> {code}
> EAGAIN looping could be stopped if we restart PBX AGI server. After AGI server restart Asterisk log shows us that the call hangs up normally:
> {code}
> [Dec 16 19:50:47] VERBOSE[6767][C-00002c40] res_agi.c: AGI Rx << temp buffer  - errno Resource temporarily unavailable
> ...here we restart AGI server
> [Dec 16 19:50:47] VERBOSE[6767][C-00002c40] res_agi.c: <IAX2/mtc-16527>AGI Script agi://x.y.z.c:4573/some_script completed, returning 0
> [Dec 16 19:50:47] VERBOSE[6767][C-00002c40] res_agi.c: <IAX2/mtc-16527>AGI Tx >> HANGUP
> [Dec 16 19:50:47] VERBOSE[6767][C-00002c40] chan_iax2.c: Hungup 'IAX2/mtc-16527'
> {code}
> We tried to find out whether there is similar bug report, and have found this one:
> ASTERISK-20061
> But the difference is that this bug report tells about AGI not working at all due to message format inconsistency (no new line). In our case everything works OK, but occasionally fails. Another difference is that each time looping occurs on different AGI script.
> Also, after looping indefinitely for some time (an hour or so) Asterisk stops processing calls at all.
> So, my questions are:
> * how could we debug this issue further?
> * is there any protection against misbehaving AGI scripts for Asterisk not to stuck in infinite loop because of malformed AGI server message (if that is the reason)?

--
This message was sent by Atlassian JIRA
(v6.2#6252)