[asterisk-bugs] [JIRA] (ASTERISK-25638) pjsip: Deadlock between monitor thread and worker threads

Fri Jan 22 12:40:33 CST 2016

     [ https://issues.asterisk.org/jira/browse/ASTERISK-25638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Serov updated ASTERISK-25638:
-------------------------------------

    Attachment: 2016_01_22__17_34_01.top.txt
                2016_01_22__17_34_01.netstat.txt
                2016_01_22__17_34_01.locks.txt
                2016_01_22__17_34_01.full.tail.txt
                2016_01_22__17_34_01.backtrace-threads.txt
                2016_01_22__17_33_01.top.txt
                2016_01_22__17_33_01.netstat.txt
                2016_01_22__17_33_01.locks.txt
                2016_01_22__17_33_01.full.tail.txt
                2016_01_22__17_33_01.backtrace-threads.txt

It's duplicate of ASTERISK-25439
Every day my production server hangs with this lock. I have a lot of logs of .lock, backtrace, logs, "top", "ping", "netstat" with pjsip debug also.
There is two main errors: this deadlock and segfault in find_entry (I think they are related by a common cause).
The archives of all these logs can provide to get rid of this very disturbing problem to me.

To show the nature of the deadlock it is added the logs of today's hangs.
Two sets of logs within 1 minute. The program is all in the same position, still uses the CPU, all endpoint disconnected

> pjsip: Deadlock between monitor thread and worker threads
> ---------------------------------------------------------
>
>                 Key: ASTERISK-25638
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25638
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_pjsip
>    Affects Versions: 13.6.0, 13.7.0
>         Environment: Centos 7.1
> Pjproject 2.4.5
>            Reporter: Jani Aho
>            Assignee: Jani Aho
>         Attachments: 2016_01_22__17_33_01.backtrace-threads.txt, 2016_01_22__17_33_01.full.tail.txt, 2016_01_22__17_33_01.locks.txt, 2016_01_22__17_33_01.netstat.txt, 2016_01_22__17_33_01.top.txt, 2016_01_22__17_34_01.backtrace-threads.txt, 2016_01_22__17_34_01.full.tail.txt, 2016_01_22__17_34_01.locks.txt, 2016_01_22__17_34_01.netstat.txt, 2016_01_22__17_34_01.top.txt, backtrace-threads.txt, core-show-locks.txt
>
>
> On some of our asterisk servers which server over 50k calls per day, we experience deadlocks in pjsip. These deadlocks can occur between multiple times per day to maybe once a week.
> The behaviour we see is that no SIP dialogs are created. The only messages that are sent are BYE messages, but the replies to these messages aren't handled.
> After one of those occasions we enabled debugging and got backtraces of locks and threads. These are from 13.6.0, but we have had the same kinds of deadlocks on 13.7.0-rc1.
> I've tried to make sense of the backtraces and I've made some observations.
> 1. In backtrace-threads.txt Thread 644 seems to be stuck in find_entry in pjlib/src/pj/hash.c defined on line 133. That thread has acquired the mod_ua.mutex in pjsip_ua_find_dialog
> 2. The threads, like thread 632, which are waiting in _lll_lock_wait are trying to acquire the mutex as above in pjsip_ua_register_dlg.
> 3. There are a bunch of other calls tried to be made but they don't seem to get processed by the taskprocessor.

--
This message was sent by Atlassian JIRA
(v6.2#6252)