[asterisk-bugs] [JIRA] (ASTERISK-29535) Segmentation fault in libasteriskpj.so.2
Allan Rossi Lisboa (JIRA)
noreply at issues.asterisk.org
Thu Aug 19 06:57:34 CDT 2021
[ https://issues.asterisk.org/jira/browse/ASTERISK-29535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=255984#comment-255984 ]
Allan Rossi Lisboa commented on ASTERISK-29535:
-----------------------------------------------
We've been able to narrow a bit the cause of this problem.
This is our PJSIP configuration:
[clickproxytrunk]
type=aor
contact=sip:<my hostname>:5060
[clickproxytrunk]
type=endpoint
context=acproxycontext
rtp_symmetric=yes
force_rport=yes
rewrite_contact=yes
disallow=all
allow=ulaw
aors=clickproxytrunk
[clickproxytrunk]
type=identify
endpoint=clickproxytrunk
match=<my hostname>
This is the flow of what we identified as relevant:
1) We start the call through ARI
2) Asterisk correctly receives the POST
3) ARI's web socket receives information about channel created
4) 30 seconds pass and the web socket receives channel destroyed with cause 18, AST_CAUSE_NO_USER_RESPONSE
4.1) Some of our calls have 25s timeout (most have 60s). Those that have 25s timeout returned with a different cause but caused the same problem.
5) PJSIP starts creating the invite request and breaks when reading a header. Here we noticed that, although it consistently breaks at the same header, other data is already corrupted. The core dump we debugged showed the enum that contains the value for the SIP header with incorrect values (high int numbers) and strings not properly terminated or with incorrect length, resulting in garbage showing.
Another thing that we noticed in that last step was that the "msg" being processed had the IP for <my hostname> from PJSIP's configuration. On a hunch that the DNS resolution was taking too long and PJSIP would run even after ARI had already answered that the channel was destroyed we changed that PJSIP configuration to <my valid public IP> and the problem that was happening ~5 times a day went away.
Before acting on that hunch we checked the path in the code that the ARI requests goes through and found this comment in app_dial.cp: "XXX this code is highly suspicious, as it essentially overwrites the outgoing channel without properly deleting it."
That raised some eyebrows and gave more confidence that the timing between these actions was the cause of the problem.
tl;dr. A channel created through ARI that takes too long for PJSIP to process, when the DNS resolution takes too long, for instance, will have it's data corrupted or deleted while still allowing for PJSIP to process it, potentially causing a SEGFAULT.
(Restricted to Public group)
> Segmentation fault in libasteriskpj.so.2
> ----------------------------------------
>
> Key: ASTERISK-29535
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-29535
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: pjproject/pjsip
> Affects Versions: 18.5.1
> Environment: Linux 34104asterisk 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> Reporter: Daniel Bonazzi
> Attachments: core.34104asterisk-2021-07-27T09-16-30-0400-brief.txt, core.34104asterisk-2021-07-27T09-16-30-0400-full.txt, core.34104asterisk-2021-07-27T09-16-30-0400-info.txt, core.34104asterisk-2021-07-27T09-16-30-0400-locks.txt, core.34104asterisk-2021-07-27T09-16-30-0400-thread1.txt, valgrind_2021-08-05_08_09_18, valgrind_2021-08-05_08_10_15, valgrind_2021-08-05_14_48_49, valgrind_2021-08-06_13_46_56, valgrind_2021-08-06_13_52_27
>
>
> I've been facing some segmentation faults on asterisk without any apparent reason.
> This is what shows on the system logs:
> {noformat}
> Jul 27 09:16:30 34104asterisk kernel: asterisk[6556]: segfault at 0 ip 00007fcc4298678f sp 00007fcacf8ab938 error 6 in libasteriskpj.so.2[7fcc42871000+168000]
> Jul 27 09:16:30 34104asterisk asterisk[223321]: /usr/sbin/safe_asterisk: line 171: 223349 Segmentation fault (core dumped) nice -n $PRIORITY "${ASTSBINDIR}/asterisk" -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY}
> {noformat}
> Checking the logs of our system that connects to asterisk via ARI and checking the asterisk logs we could see some cases where it crashed after some calls to the ARI API were made in a certain order like:
> {noformat}
> POST /ari/channels/1627495070.126/snoop?app=stasis-&spy=both&whisper=none'
> POST /ari/channels/1627495070.126/moh
> DELETE /ari/channels/1627495070.126
> POST /ari/channels/1627495070.126/moh
> {noformat}
> It is not always that those requests get to asterisk out of order that the crashes happen, but all the time that it crashed we I could saw this pattern.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list