[asterisk-bugs] [JIRA] (ASTERISK-25638) pjsip: Deadlock between monitor thread and worker threads

Richard Mudgett (JIRA) noreply at issues.asterisk.org
Thu Jan 7 14:01:33 CST 2016


     [ https://issues.asterisk.org/jira/browse/ASTERISK-25638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Mudgett updated ASTERISK-25638:
---------------------------------------

    Assignee: Jani Aho  (was: Richard Mudgett)
      Status: Waiting for Feedback  (was: Open)

The backtrace is not showing a classical deadlock.  The problem might not be a deadlock at all as your system may be running low on CPU cycles or physical memory that the machine is so far behind in processing SIP messages as to appear deadlocked.

# Thread 644 would be the key thread if it somehow got blocked.  The find_entry() function currently being executed by the thread does not block.  The find_entry() function is likely walking the linked list of dialogs.  Unless the dialog table linked list is exceedingly long or is corrupted and turned into an unending circular chain of entries then find_entry() won't take long.  I checked that the dialog table linked list, which find_entry() is searching, is protected from reentrancy by obtaining the mod_ua.mutex lock.  I did not see any code path that failed to protect the list.
# The PJSIP thread pool threads like 632 that are waiting for the mod_ua.mutex lock shouldn't have long to wait as find_entry() shouldn't be taking very long.
# There are 613 threads that are waiting on creating an outbound dialed channel for the PJSIP thread pool to process.  These threads are waiting for a PJSIP thread pool thread to process the request.

The backtrace has told me as much as it can.  I'm going to need more information on how your system is setup and configured to try to reproduce this issue.  Please attach any needed configuration files such as pjsip.conf.  Please review
https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

When the system is in the "deadlocked" state:

What does CLI "core show taskprocessors" show about the size of the in queues?  Please attach the command output as a {{.txt}} file to the issue.  The "Processed" column tells how many tasks are completed.  The "In Queue" column tells how many tasks are waiting to be processed.  If zero then nothing is pending.  The "Max Depth" column gives the longest pending task queue seen since the taskprocessor was created.

Do repeated CLI "core show taskprocessors" requests show that the SIP-control taskprocessor is processing tasks?  Repeated backtraces can also show that things are being processed.

Does CLI "core ping taskprocessor SIP-control" respond with a ping time?  Note that the CLI will be blocked until you get a ping time response.


> pjsip: Deadlock between monitor thread and worker threads
> ---------------------------------------------------------
>
>                 Key: ASTERISK-25638
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25638
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_pjsip
>    Affects Versions: 13.6.0, 13.7.0
>         Environment: Centos 7.1
> Pjproject 2.4.5
>            Reporter: Jani Aho
>            Assignee: Jani Aho
>         Attachments: backtrace-threads.txt, core-show-locks.txt
>
>
> On some of our asterisk servers which server over 50k calls per day, we experience deadlocks in pjsip. These deadlocks can occur between multiple times per day to maybe once a week.
> The behaviour we see is that no SIP dialogs are created. The only messages that are sent are BYE messages, but the replies to these messages aren't handled.
> After one of those occasions we enabled debugging and got backtraces of locks and threads. These are from 13.6.0, but we have had the same kinds of deadlocks on 13.7.0-rc1.
> I've tried to make sense of the backtraces and I've made some observations.
> 1. In backtrace-threads.txt Thread 644 seems to be stuck in find_entry in pjlib/src/pj/hash.c defined on line 133. That thread has acquired the mod_ua.mutex in pjsip_ua_find_dialog
> 2. The threads, like thread 632, which are waiting in _lll_lock_wait are trying to acquire the mutex as above in pjsip_ua_register_dlg.
> 3. There are a bunch of other calls tried to be made but they don't seem to get processed by the taskprocessor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list