[asterisk-bugs] [JIRA] (ASTERISK-25275) A11 SIGSEGV from pjnpath check_cached_response (ast_rtcp_read -> pj_stun_session_on_rx_pkt)

Dade Brandon (JIRA) noreply at issues.asterisk.org
Fri Jan 8 12:47:33 CST 2016


    [ https://issues.asterisk.org/jira/browse/ASTERISK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=228728#comment-228728 ] 

Dade Brandon edited comment on ASTERISK-25275 at 1/8/16 12:46 PM:
------------------------------------------------------------------

You can work around this 100% in Asterisk-11 by putting "cache_res=PJ_FALSE;" in res/pjproject/pjnath/src/pjnath/stun_session.c at the top of the function pj_stun_session_send_message (ie right before the PJ_ASSERT_RETURN macro call, but after pj_status_t status;

That causes the corruption to cached_response_list to not happen, because nothing is ever added to that list structure.  The corruption occurs because the timer that is created to remove items from the cache doesn't lock.  This will also solve likely all of your deadlocks, since sometimes the corruption causes a null pointer dereference, other times it check_cached_response is in its loop, and "t" will never be == &sess->cached_response_list.

Asterisk 11 includes an older version of pjproject.  -Newer pjproject appears to have solved this via a new group lock mechanism & an added lock in the timer that evicts from the  cached_response_list.  I've found to be difficult to backport the group lock mechanism, although one of my co-workers has been working on it, and will probably be able to contribute that work back to the 11 trunk after the holidays- _(text deleted because of reports that newer pjproject is affected by the same bug + is resolved by the same patch, so the list corruption must be occurring for a different reason other than the unlocked timer I mentioned)_.

I've reviewed the use for the cached_response_list, and it appears to be trivial, other than for protocol compliance.  Keep in mind that if you update pjproject in Asterisk-11, it has to be done in a manner which statically links the libs, the same way as Asterisk-11 already does.  That makes it pretty difficult to do.  i did it as a test before coming up with the cache_res=PJ_FALSE workaround, and had difficulty because even after fixing the non-backwards-compatible function calls in res_rtp_asterisk.c, we had new crashes (we didn't bother to debug since that implied an unknown amount of non-backwards-compatible changes to how Asterisk-11 would need to work with pjproject).  I am just saying this because I'm guessing there's at least a chance that you'd installed the updated pjproject as a shared lib, and Asterisk-11 was not using it.


was (Author: dade):
You can work around this 100% in Asterisk-11 by putting "cache_res=PJ_FALSE;" in res/pjproject/pjnath/src/pjnath/stun_session.c at the top of the function pj_stun_session_send_message (ie right before the PJ_ASSERT_RETURN macro call, but after pj_status_t status;

That causes the corruption to cached_response_list to not happen, because nothing is ever added to that list structure.  The corruption occurs because the timer that is created to remove items from the cache doesn't lock.  This will also solve likely all of your deadlocks, since sometimes the corruption causes a null pointer dereference, other times it check_cached_response is in its loop, and "t" will never be == &sess->cached_response_list.

Asterisk 11 includes an older version of pjproject.  Newer pjproject appears to have solved this via a new group lock mechanism & an added lock in the timer that evicts from the  cached_response_list.  I've found to be difficult to backport the group lock mechanism, although one of my co-workers has been working on it, and will probably be able to contribute that work back to the 11 trunk after the holidays.  I've reviewed the use for the cached_response_list, and it appears to be trivial, other than for protocol compliance.  Keep in mind that if you update pjproject in Asterisk-11, it has to be done in a manner which statically links the libs, the same way as Asterisk-11 already does.  That makes it pretty difficult to do.  i did it as a test before coming up with the cache_res=PJ_FALSE workaround, and had difficulty because even after fixing the non-backwards-compatible function calls in res_rtp_asterisk.c, we had new crashes (we didn't bother to debug since that implied an unknown amount of non-backwards-compatible changes to how Asterisk-11 would need to work with pjproject).  I am just saying this because I'm guessing there's at least a chance that you'd installed the updated pjproject as a shared lib, and Asterisk-11 was not using it.

> A11 SIGSEGV from pjnpath check_cached_response (ast_rtcp_read -> pj_stun_session_on_rx_pkt)
> -------------------------------------------------------------------------------------------
>
>                 Key: ASTERISK-25275
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25275
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>    Affects Versions: 11.18.0
>         Environment: Ubuntu 14.04.2; Linux 3.13.0-24-generic SMP; Intel E3-1231 
> Openssl 1.0.1f-1ubuntu2.15 (Jun 11 2015; most recent available) 
> libsrtp0 / libsrtp0-dev 1.4.5~20130609~dfsg-1
>            Reporter: Dade Brandon
>         Attachments: 2-1-phx-crash-jul23-510PST-backtrace.txt, 2-1-phx-crash-jul23-510PST-debuglog.p.txt.gz, 6-2-phx-crash_jul_22_1043am_backtrace.txt, 6-2-phx-crash_jul_22_1043am.p.txt.gz, atlas-backtrace-july22 2015.txt, backtrace.2015-12-23_1412.txt, commit log.txt, crash_asterisk_13.tar.gz, debug5.log, fenrir-debug-aug17.zip, fenrir-debug_more-aug17c.zip, fullbt-aug17c.txt
>
>
> This may be a duplicate of my other just-created issue, ASTERISK-25274, however since the backtrace has a different signal point, I am following previous instruction to create separate issues.
> We have the patch from ASTERISK-25103 added to trunk 11 with a few custom patches (mostly just debug messages). The following crash occurs infrequently (1-5 times per week, usually batched together and on the same server(s); based on the pattern I imagine that there is a remote factor in whether or not the crash occurs, such as a slow peer )
> A full backtrace and debug log will be attached shortly after this issue is created;  here is a snip of the top chunk of the backtrace, for assistance in reviewing the issue:
> {noformat}
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 check_cached_response
> #1 pj_stun_session_on_rx_pkt ()
> #2 pj_ice_sess_on_rx_pkt ()
> #3 __rtp_recvfrom
>  (instance=0xvalidptr, buf=0xvalidptr, size=8192, flags=0, sa=validptr, rtcp=1)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list