[asterisk-dev] [Code Review] [patch] Use of MASTER_CHANNEL causes a race condition ending in a deadlock among channels

Sat Mar 5 23:38:32 CST 2011

On 110303 1936, Russell Bryant wrote:
>
> I took a look at the backtrace on the issue.  This is the exact same situation we were looking at.

> The pbx thread does not have the channel locked going into the waitforsilence application.  It is locking it in ast_read().  It's stuck in res_timing_timerfd.

As a general architecture question, is the channel supposed to be locked 
while in timerfd_timer_ack() by design, or this is rather a 
manifestation of a problem? We may have 10+ channels simultaneously 
calling the WaitForSilence, with a 2.5 second limit. Would that mean we 
are asking for long-lingering channel locks?

> If we had a test environment that reliably reproduced it, that would be a great start.

This was happening on a production server, and only under a high-ish 
load condition. It might work for 2 weeks without freezing up, or it 
might freeze twice a day. This is as reliable as it could get.

Now, this machine is for egress calls, placed by out AI guys. This means 
that we generally can afford some (little) downtime; still, we are 
losing the calls already in progress, so it should be used for testing 
very sparingly.

Having said that, I do want to track down the problem. I am dying to 
implement CEL, for one, and this is not in 1.6. Out call routing is 
complex enough so that CDR is messy.

> For you or anyone else that hits this same issue, the only workaround right now is to install DAHDI and use the res_timing_dahdi module instead of res_timing_timerfd.

An interesting observation is that 1.6.2, also configured with timerfd, 
has never has a single freeze-up.

Do you think installing the latest kernel may affect the problem? Mine 
is fairly old, and timerfd is fairly a recent development.

  -kkm