[asterisk-bugs] [JIRA] (ASTERISK-25000) Deadlock in ast_do_masquerade (specifically in ast_hangup on the zombie clone if it's hungup during the masquerade)

William luke (JIRA) noreply at issues.asterisk.org
Mon Apr 27 16:21:33 CDT 2015


    [ https://issues.asterisk.org/jira/browse/ASTERISK-25000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=226025#comment-226025 ] 

William luke commented on ASTERISK-25000:
-----------------------------------------

Hi,

Thanks for looking over this.

Enabling DEBUG_THREADS was one of the things I tried - unfortunately it kills the server very quickly due to the additional overhead.

I've got some additional backtraces of what I believe is the same issue occurring again. It appears that a <ZOMBIE> is being left locked after a masquerade. In the latest example it's the devicestate thread that's getting deadlocked when iterating over the list of channels (it's holding the channel container lock, and then trying to lock the channel and stalling there).

Why do we add the clonechan back to the container in ast_do_masquerade?:

 if (!clone_was_zombie) {
                ao2_link(channels, clonechan);
        }

Is the intention that the null frame we queued to it will then cause it to be cleared off? If so, where would this happen, and what about other threads accessing it inbetween?

Thanks again!




> Deadlock in ast_do_masquerade (specifically in ast_hangup on the zombie clone if it's hungup during the masquerade)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: ASTERISK-25000
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25000
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Core/Channels
>    Affects Versions: 11.16.0
>         Environment: CentOS 6. Dual Xeon Dell server. Under relatively heavy load (250k calls/day), with lots of AMI actions.
>            Reporter: William luke
>            Assignee: William luke
>            Severity: Critical
>         Attachments: backtrace-threads-20150422.txt, dialplan_snippet.txt, verboselog.rar
>
>
> We're seeing a deadlock where the AMI thread completely locks up. (Thread ID 19109 in backtrace attachment)
> A backtrace shows that it's while doing a dual redirect.
> When redirecting the second channel (from manager.c:3895), inside ast_do_masquerade, we decide the clone was a zombie, and then in channel.c line 7331 call ast_hangup on it.
> This ast_hangup tries to grab a channel lock (channel.c:2885) and hangs here indefinitely.
> What's peculiar is that a few lines higher up it's successfully managed to grab and then release this same channel lock.
> It would seem that as the masquerade begun, this channel (clonechan) had at the same moment hungup. (see line 212288 in the verboselog attachment. The channel in question is "SIP/gl-agw-01-000f7f0e")
> So something has happened to the state of this channel and something has not release it's channel_lock.
> I'm unable to see which other thread is holding the lock.
> The issue occured at 15:32:20 in the verboselog file. The first part of the dual redirect can be seen at line 212279.
> I executed a "core restart" via the CLI, but this hung the CLI, and I had to kill the Asterisk process.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list