[asterisk-bugs] [JIRA] (ASTERISK-25941) chan_pjsip: Crash on an immediate SIP final response
Mark Michelson (JIRA)
noreply at issues.asterisk.org
Tue May 17 17:49:56 CDT 2016
[ https://issues.asterisk.org/jira/browse/ASTERISK-25941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=230636#comment-230636 ]
Mark Michelson commented on ASTERISK-25941:
-------------------------------------------
I figure that an update is in order here. After investigating the logs some more, I made a realization about how the threads were interacting. Specifically, you'll typically see something like this when you receive a 486 response:
{noformat}
[May 16 17:59:04] DEBUG[2845] res_pjsip/pjsip_distributor.c: No dialog serializer for response (null). Using request transaction as basis
[May 16 17:59:04] DEBUG[2845] res_pjsip/pjsip_distributor.c: Found serializer pjsip/session/sip-standard-000000de on transaction tsx0x7faf0c282538
{noformat}
However, sometimes that second message is not there. Looking into things some more, I found that when that second message was missing, it was because the thread that sent the outgoing INVITE request is not the one that we expect it to be. The result is that the thread processing the 486 response can compete with another thread that is destroying the channel. This causes the crash.
Interestingly, looking at the timing of this, it appears that you will get the INVITE sent in an unexpected thread exactly every 20 seconds. It appears the reason for this is because of PJSIP's asynchronous DNS resolution. The INVITE does not actually get sent in the thread that we think because our attempt to send the INVITE results in an async DNS request. When the DNS result returns, a callback is called on a separate thread and the INVITE is sent there. My guess regarding the 20 second rule is that there is some place where DNS results are being cached on the system. Every 20 seconds, a new DNS query has to be made.
The reason I was unable to reproduce this myself is because I was sending my INVITE requests to a direct IP address instead of involving DNS. My goal now is to determine the proper set of steps to ensure that there are no competing threads in this situation. I hope to have a patch ready tomorrow at some point.
> chan_pjsip: Crash on an immediate SIP final response
> ----------------------------------------------------
>
> Key: ASTERISK-25941
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-25941
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Resources/res_ari, Resources/res_pjsip
> Affects Versions: 13.8.0, 13.8.1, 13.8.2, 13.9.0
> Reporter: Javier Riveros
> Attachments: backtrace.txt, full_after_patch_1.zip, full_debug, gdb_after_patch_1.txt, gdb.txt, kern.log, session_end_response.diff
>
>
> *Description*
> I start experience constant asterisk crashes on our pro asterisk boxes seems like when the box is a little busy (+10 calls) asterisk is more susceptible.
> So the issue seems to be for outbound/inbound calls, i see this more frequent in outbound calls. Doing a outbound call into the stasis app and get a fast busy signal causes race condition opening the door to crash occur.
> 1) outbound call endpoint -> asterisk -> trunk.
> 2) trunk (486 Busy Here) -> asterisk (crash).
> You will see a segfault like this :
> {noformat}
> 2016-04-19 18:41:30.783 | info | kernel[-] |[436624.469629] asterisk[3737]: segfault at c0 ip 00000000004ce35b sp 00007f00ee7446d0 error 4 in asterisk[400000+2bc000]
> {noformat}
> And a coredumps like this:
> {noformat}
> #0 0x00000000004ce35b in ast_channel_dialed_causes_add (chan=0x0, cause_code=0x7f96ba940860, datalen=109) at channel_internal_api.c:1448
> ao2_cause_code = 0x0
> #1 0x00000000004b7423 in ast_channel_hangupcause_hash_set (chan=0x0, cause_code=0x7f96ba940860, datalen=109) at channel.c:4356
> causevar = '\000' <repeats 16 times>, "@\t\224\272\226\177\000\000X\264\f\354\226\177\000\000p\a\224\272\226\177\000\000\236\341J\000\000\000\000\000\240\a\224\272\226\177\000\000h\316\f\354\226\177\000\000@\b\224\272\226\177\000\000\212\347J\000\000\000\000\000m\000\000\000\000\000\000\000`\b\224\272\226\177\000\000\260\a\224\272!\000\000\000h\316\f\354\226\177\000\000\004\000\000\000\000\000\000\000!", '\000' <repeats 23 times>, "m", '\000' <repeats 39 times>, "`\b\224\272\226\177", '\000' <repeats 59 times>...
> __PRETTY_FUNCTION__ = "ast_channel_hangupcause_hash_set"
> #2 0x00007f96c35293cf in chan_pjsip_incoming_response (session=0x7f96ec0cb458, rdata=<optimized out>) at chan_pjsip.c:2319
> status = {code = 486, reason = {ptr = <optimized out>, slen = 9}}
> cause_code = 0x7f96ba940860
> data_size = 109
> __PRETTY_FUNCTION__ = "chan_pjsip_incoming_response"
> #3 0x00007f96e33cf846 in handle_incoming_response (type=<optimized out>, response_priority=AST_SIP_SESSION_AFTER_MEDIA, rdata=0x7f97049443d8,
> session=0x7f96ec0cb458) at res_pjsip_session.c:2323
> supplement = 0x7f96ec0ccab0
> status = {code = <optimized out>, reason = {ptr = 0x7f9704945740 "Busy HereV\224\004\227\177", slen = 9}}
> #4 handle_incoming (session=session at entry=0x7f96ec0cb458, rdata=0x7f97049443d8, response_priority=response_priority at entry=AST_SIP_SESSION_AFTER_MEDIA,
> type=<optimized out>) at res_pjsip_session.c:2337
> No locals.
> #5 0x00007f96e33d15e8 in session_inv_on_tsx_state_changed (inv=0x7f96ec0cb218, tsx=0x7f9718010868, e=0x7f96ba940ad0) at res_pjsip_session.c:2512
> cb = 0x0
> session = 0x7f96ec0cb458
> tdata = 0x7f96ba940ad0
> __PRETTY_FUNCTION__ = "session_inv_on_tsx_state_changed"
> #6 0x00007f972448b2d4 in mod_inv_on_tsx_state (tsx=0x7f9718010868, e=0x7f96ba940ad0) at ../src/pjsip-ua/sip_inv.c:699
> dlg = 0x7f9718048e08
> inv = 0x7f96ec0cb218
> #7 0x00007f97244d0fb0 in pjsip_dlg_on_tsx_state (dlg=0x7f9718048e08, tsx=0x7f9718010868, e=0x7f96ba940ad0) at ../src/pjsip/sip_dialog.c:2013
> i = 2
> #8 0x00007f97244d17c0 in mod_ua_on_tsx_state (tsx=0x7f9718010868, e=0x7f96ba940ad0) at ../src/pjsip/sip_ua_layer.c:178
> dlg = 0x7f9718048e08
> #9 0x00007f97244ca70c in tsx_set_state (tsx=0x7f9718010868, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG,
> event_src=0x7f97049443d8) at ../src/pjsip/sip_transaction.c:1213
> e = {prev = 0x7f9704004ef0, next = 0x7f97040050c8, type = PJSIP_EVENT_TSX_STATE, body = {timer = {entry = 0x7f97049443d8}, tsx_state = {src = {
> rdata = 0x7f97049443d8, tdata = 0x7f97049443d8, timer = 0x7f97049443d8, status = 76825560, data = 0x7f97049443d8},
> tsx = 0x7f9718010868, prev_state = 3, type = PJSIP_EVENT_RX_MSG}, tx_msg = {tdata = 0x7f97049443d8}, tx_error = {tdata = 0x7f97049443d8,
> tsx = 0x7f9718010868}, rx_msg = {rdata = 0x7f97049443d8}, user = {user1 = 0x7f97049443d8, user2 = 0x7f9718010868, user3 = 0x300000003,
> user4 = 0x7f97049443d8}}}
> prev_state = PJSIP_TSX_STATE_PROCEEDING
> ....
> {noformat}
> *To reproduce*
> I was able to reproduce this easy i just do the following. with two asterisk boxes one acting as a carrier/trunk "Asterisk A" and another with my basic stasis app "Asterisk B".
> 1) on "Asterisk A" i just create a simple dialplan that response 486 Busy or another 4XX code, this simulating our carrier given a fast 486 busy signal.
> {code}
> exten => _[+][1][7]X.,1,Noop("test")
> same => n,Busy()
> same => n,Hangup()
> exten => _[+][1][8]X.,1,Noop("test")
> same => n,Dial(SIP/1002) ; this endpoint isn't register
> same => n,Hangup()
> {code}
> 2) on "Asterisk B" in other to add a little of load i just stress the box a little bit until load average were like 2 "normal busy load average" {{sudo stress -d 2 --hdd-bytes 512M}}
> 3) on "Asterisk B" once the the box is busy i just make couple of outbound calls and asterisk crash, "this simulate an outbound call to a carrier that response a fast busy signal"
> There is some external ways i'm workaround this but please asterisk don't let you crash in fast SIP ERROR codes :).
> Attached are the coredumps and debug logs anything else is needed let me know.
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list