[asterisk-dev] Deadlock detected in asterisk-1.8.9.0 x86_64

Richard Mudgett rmudgett at digium.com
Tue Jan 31 15:47:47 CST 2012


> I am having problems with a deadlock in Asterisk 1.8.9.0.
> 
> The system is a x86_64 machine that is being used as a callcenter.
> The agents log in via the AgentLogin application, and each
> Agent/XXXX channel is assigned to one or more queues. A custom
> separate process generates calls into the queues for the agents to
> answer. The calls all go out through a SIP trunk, and all of the
> agent extensions are SIP. After an hour or so, asterisk deadlocks.
> Any attempt to run "agent show" or "agent show online" through the
> console hangs. Also, AMI events seem to stop. However, the users
> seem to be still connected, only they do not receive calls anymore
> (the custom process waits forever for the Originate response). The
> deadlock is apparently spontaneous - there is no explicit action
> taken by the administrator that seems to induce the issue. I will
> try to make sense of the attached traces, but I hope someone on the
> list could provide a clue on what to look for.
> 
> Backtraces attached to
> https://issues.asterisk.org/jira/browse/ASTERISK-19285 I have marked
> this bug as a regression because the patch that is supposed to fix
> https://issues.asterisk.org/jira/browse/ASTERISK-18092 is probably
> the one that introduced this bug.
> 
> This is what I believe is happening:
> 
> 
> The user is running a script that periodically invokes the AMI action
> "Agents", which is handled by action_agents() in
> channels/chan_agent.c:1499. This function traverses the agent list,
> and for each one first takes a lock on struct agent_pvt *p
> (chan_agent.c:1516), then attempts to take a lock on p->owner (a
> channel of type Agent, I think) at chan_agent.c:1534, in order to
> check whether this is a bridged channel. This second lock is the one
> that is introduced by the patch that "fixes" ASTERISK-18092 .
> 
> Meanwhile, in another thread, some frames need to be written to the
> Agent/xxxx channel, at ast_write() in main/channel.c:4767 . In
> channel.c:4774, a lock is taken on the channel (which happens to be
> the one at p->owner), and then the tech-specific write method is
> invoked at channel.c:5032. For Agent channels, this method is
> agent_write() at channels/chan_agent.c:691. This method extracts
> tech_pvt from the channel (which happens to be an agent_pvt, the one
> picked up in the other thread at line 1516), then attempts to take a
> lock on it. Therefore, a deadlock.
> 
> I was about to perform what amounts to a revert of the fix for
> ASTERISK-18092 , but then I looked in the CHANGELOG and realized
> this other issue. Also, I found an inconsistency in the handling of
> the Agents action as compared to the commands "agent show" and
> "agent show online". If the thread holds a lock on agent_pvt, and
> should then take a lock on agent_pvt->owner, then agents_show() at
> line 1702 and agents_show_online() at line 1771 must have the same
> lock taken in order to be consistent with the "fix" for
> action_agents(). On the other hand, I originally decided to revert
> the change in action_agents() in order to make the function
> consistent with agents_show() and agents_show_online() which do not
> take the lock. So, which lock order is the correct one? Should the
> code at chan_agent.c release the lock on agent_pvt before taking the
> lock on the owner channel? Should the ast_channel_lock() at
> chan_agent.c:1534 be replaced by a call to ast_channel_trylock(), as
> used by main/channel.c ? Other ideas?
> 
The established locking order is
1. channel lock
2. channel tech private lock

Locking the other way goes against the established locking order and
needs to be fixed by one of the following methods:
1. deadlock avoidance techniques
2. You could could also look at chan_local.c:awesome_locking() and
chan_sip.c:sip_pvt_lock_full() for a method that avoids the deadlock
avoidance loop.

Deadlock avoidance needs to be done when attempting to get two channel locks.
ast_channel_lock_both()

The more locks held, the trickier it becomes to acquire all of the locks.

Richard



More information about the asterisk-dev mailing list