[asterisk-dev] Deadlock detected in asterisk-1.8.9.0 x86_64

Alex Villací­s Lasso a_villacis at palosanto.com
Tue Jan 31 16:12:59 CST 2012


El 31/01/12 16:47, Richard Mudgett escribió:
>> I am having problems with a deadlock in Asterisk 1.8.9.0.
>>
>> The system is a x86_64 machine that is being used as a callcenter.
>> The agents log in via the AgentLogin application, and each
>> Agent/XXXX channel is assigned to one or more queues. A custom
>> separate process generates calls into the queues for the agents to
>> answer. The calls all go out through a SIP trunk, and all of the
>> agent extensions are SIP. After an hour or so, asterisk deadlocks.
>> Any attempt to run "agent show" or "agent show online" through the
>> console hangs. Also, AMI events seem to stop. However, the users
>> seem to be still connected, only they do not receive calls anymore
>> (the custom process waits forever for the Originate response). The
>> deadlock is apparently spontaneous - there is no explicit action
>> taken by the administrator that seems to induce the issue. I will
>> try to make sense of the attached traces, but I hope someone on the
>> list could provide a clue on what to look for.
>>
>> Backtraces attached to
>> https://issues.asterisk.org/jira/browse/ASTERISK-19285 I have marked
>> this bug as a regression because the patch that is supposed to fix
>> https://issues.asterisk.org/jira/browse/ASTERISK-18092 is probably
>> the one that introduced this bug.
>>
>> This is what I believe is happening:
>>
>>
>> The user is running a script that periodically invokes the AMI action
>> "Agents", which is handled by action_agents() in
>> channels/chan_agent.c:1499. This function traverses the agent list,
>> and for each one first takes a lock on struct agent_pvt *p
>> (chan_agent.c:1516), then attempts to take a lock on p->owner (a
>> channel of type Agent, I think) at chan_agent.c:1534, in order to
>> check whether this is a bridged channel. This second lock is the one
>> that is introduced by the patch that "fixes" ASTERISK-18092 .
>>
>> Meanwhile, in another thread, some frames need to be written to the
>> Agent/xxxx channel, at ast_write() in main/channel.c:4767 . In
>> channel.c:4774, a lock is taken on the channel (which happens to be
>> the one at p->owner), and then the tech-specific write method is
>> invoked at channel.c:5032. For Agent channels, this method is
>> agent_write() at channels/chan_agent.c:691. This method extracts
>> tech_pvt from the channel (which happens to be an agent_pvt, the one
>> picked up in the other thread at line 1516), then attempts to take a
>> lock on it. Therefore, a deadlock.
>>
>> I was about to perform what amounts to a revert of the fix for
>> ASTERISK-18092 , but then I looked in the CHANGELOG and realized
>> this other issue. Also, I found an inconsistency in the handling of
>> the Agents action as compared to the commands "agent show" and
>> "agent show online". If the thread holds a lock on agent_pvt, and
>> should then take a lock on agent_pvt->owner, then agents_show() at
>> line 1702 and agents_show_online() at line 1771 must have the same
>> lock taken in order to be consistent with the "fix" for
>> action_agents(). On the other hand, I originally decided to revert
>> the change in action_agents() in order to make the function
>> consistent with agents_show() and agents_show_online() which do not
>> take the lock. So, which lock order is the correct one? Should the
>> code at chan_agent.c release the lock on agent_pvt before taking the
>> lock on the owner channel? Should the ast_channel_lock() at
>> chan_agent.c:1534 be replaced by a call to ast_channel_trylock(), as
>> used by main/channel.c ? Other ideas?
>>
> The established locking order is
> 1. channel lock
> 2. channel tech private lock
>
> Locking the other way goes against the established locking order and
> needs to be fixed by one of the following methods:
> 1. deadlock avoidance techniques
> 2. You could could also look at chan_local.c:awesome_locking() and
> chan_sip.c:sip_pvt_lock_full() for a method that avoids the deadlock
> avoidance loop.
>
> Deadlock avoidance needs to be done when attempting to get two channel locks.
> ast_channel_lock_both()
>
> The more locks held, the trickier it becomes to acquire all of the locks.
>
> Richard
>
>
I decided to go the standard way (as used in the rest of the chan_agent.c code) and do deadlock avoidance. The patch was attached to the bug report at https://issues.asterisk.org/jira/browse/ASTERISK-19285. To be safe, I also added the same deadlock 
avoidance in the handlers for "agent show" and "agent show online". Please check it out and comment on it.



More information about the asterisk-dev mailing list