[asterisk-bugs] [JIRA] (ASTERISK-26310) Crash occurs every 24 - 48 hours with backtrace log showing fault related to pjsip hash

Wed Sep 14 13:48:01 CDT 2016

    [ https://issues.asterisk.org/jira/browse/ASTERISK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=232284#comment-232284 ] 

Gaston Mendez commented on ASTERISK-26310:
------------------------------------------

George,

Thank you for finally updating the case. Can you let me know what you think is happening? Do you know where the failure is occurring? This being my first asterisk case and the fact we had such good documentation coupled with the frequent crashing, I was hoping for more information. This really tells us almost nothing from the perspective of the Asterisk team. It almost seems like well you guys really don't know. Which is okay. Again first experience here.

We definitely couldn't wait 2 weeks for an update on this, so we did move off of Asterisk 13 already. We already moved to Asterisk 14-beta2 as the PJSIP dns resolver code looks really bad in the backtraces so we'd prefer to bypass the dns resolver completely and use asterisk 14 with unbound configured as that is where we already determined the crash to be happening. We have only been on 14 since Friday so it is not quite long enough to say it's good to go. We will hold on 14 for at least 2 more weeks. And if it doesn't crash again I don't believe we'll have any reason to go back to 13 ever again. If it does crash again, we will try 13.11.2 as you recommended, but based on our intel we think 14's got the real fix for us.

> Crash occurs every 24 - 48 hours with backtrace log showing fault related to pjsip hash
> ---------------------------------------------------------------------------------------
>
>                 Key: ASTERISK-26310
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-26310
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: pjproject/pjsip
>    Affects Versions: 13.10.0, 13.11.0
>         Environment: Asterisk 13.10.0 running on fully updated Centos 7 linux 64bit. We also have a second backtrace showing the same ../src/pj/hash.c:181 in the (gdb) bt output from a second asterisk server running Asterisk 13.11.0-rc1 so we think we are crashing the same way across the 2 latest versions of asterisk 13.
>            Reporter: Gaston Mendez
>            Assignee: Gaston Mendez
>            Severity: Critical
>         Attachments: asterisk_full_08-29-2016-0924a.txt, asterisk_full_08302016_0620p.txt, asterisk_full_08302016_0624p.txt, backtrace_08302016_0620p.txt, backtrace_08302016_0624p.txt, backtrace13-10-0-on-08-29-2016-0930a.txt, backtrace13-10-0.txt, backtrace13-11-0.txt, full_log_13-10-0.txt, full_log_13-11-0.txt, modules.conf.txt, pjsip.conf.txt, rtp.conf.txt, udptl.conf.txt
>
>
> We are trying to put an Asterisk 13 server into production. First time using pjsip as well. When we get to a loaded beta of 20 active calls we are experiencing crashes unpredictably and without a visible error or commonality between crashes. It is not load dependent because we have seen it crash at low points during the day with literally 1 - 2 active calls running during the crash. The only thing that's certain is that after steady load of every day use in 2 week beta we know it will crash every 48 hours, and more like every 24 hours. It will crash with no visible error or complaint in asterisk messages or full logs which are very clean and quiet logs. The coredump shows it citing line 181 of ../src/pj/hash.c and the only known commonality we have between crashes is that we have at least 2 backtraces on 2 different servers citing this same line of code in the back trace (gdb bt) like this:
> {noformat}
> #0  find_entry (lower=0, entry_buf=0x0, hval=0x7f52cc5412cc, val=0x0, keylen=258, key=0x7f52cc541310, ht=<optimized out>, pool=0x0) at ../src/pj/hash.c:181
> {noformat}
> {noformat}
> 181		if (entry->hash==hash && entry->keylen==keylen &&
> {noformat}
> It seems there is some instability we must be triggering in pjsip/asterisk. We are not doing anything outside the norm of what we've done on old versions of asterisk. Asterisk throws no message errors at any time, and other than this once a day crash, asterisk 13 is running very clean and high performing with no other complaint at all. We have reason to believe this is some asterisk/pjsip bug we have triggered. There are no exact steps to trigger it. It seems as long as there is at least 1 active call it can happen. It also happens about once every 24-48 hours for a span of 2 weeks. So the only way to 'reproduce' it is to wait 48 hours as we have been. We have multiple backtraces and are attaching 2 that show the same exact source code file and line number. As stated in the environment section we are crashing across 2 servers, the second being identical centos 7 fully yum updated 64 bit linux with the second server running Asterisk 13.11.0-rc1. We will attach everything we have from both servers and file it as a bug report and hope we can stabilize the system asap.

--
This message was sent by Atlassian JIRA
(v6.2#6252)