[asterisk-dev] tilghman: branch 1.6.0 r244072 - in /branches/1.6.0: ./ channels/ main/

Russell Bryant russell at digium.com
Tue Feb 2 09:15:12 CST 2010


On 02/01/2010 11:41 PM, Tilghman Lesher wrote:
> Actually, no, I have not reintroduced another problem.  The original report
> was never really resolved, as evidenced by the below log.  So before anything
> was done, we had the error message.  We have the error message before this
> patch, and we have the error message after this patch (although finally, the
> error message NOW will appear much less often).  So whether or not the warning
> exists after the latest patch is mostly irrelevant to the discussion.  That it
> occurs much less frequently than before is helpful to the situation.

If the warning still occurs, it is relevant.  It is indicative of a 
problem that we must get to the bottom of.  The bug, this commit, and 
this discussion are all about that problem.  The change to make the 
message come up much less often is certainly welcome, but we still have 
a problem to figure out.

>>>> if (other->pbx || other->_bridge || !ast_strlen_zero(other->appl)) {
>>>>            ast_queue_frame(other, f);
>>>> }

>> Based on this discussion, I do not agree with the patch committed.  The
>> code that was in chan_local already accounted for this situation.  In
>> this situation, both sides of a Local channel will have a non-NULL pbx
>> on the ast_channel.  So, something else is going on and we should figure
>> out what it is.
>
> It may have accounted for the situation, but it did not solve the problem,
> because the frame queue still overflowed.  If we don't solve that problem,
> any code change merely spits into the wind.  We have to avoid the frame
> queue constantly overflowing, or we aren't really solving the problem.

We solved it for some people with those changes to chan_local.  That's 
why it went in in the first place.  I certainly do agree that we must 
keep working on it to figure out what situations have not been accounted 
for and why.

>> Take for example:
>>
>> *CLI>  originate Local/123 at foo extension 456 at bar
>>
>> Here is what you have connected.  Both sides are running a PBX.
>>
>> (Some app)<-->
>>      PBX<-->
>>         [Local/123 at foo,1]<-->
>>            [Local/123 at foo,2]<-->
>>               PBX<-->
>>                  (Some app)
>
> What we are trying to do is prevent the frame queue from overflowing, by
> detecting the particular reasons why a frame queue should not overflow.  So
> what this is really showing is that whether or not a PBX is running on a
> channel is not relevant in all cases as to whether the frame queue will
> overflow.  Whether you intended to or not, you've just disproved that the
> original patch was at all correct for the situation (because a PBX existed on
> the channel AND it was bridged, but the frame queue overflowed anyway).
> That is the problem that the latest patch solves (or at least deals with in a
> way that it does not immediately recur).

I do not think this demonstrates that the changes in chan_local are 
incorrect.  First of all, there is nothing in this diagram that shows a 
bridge (that would satisfy the non-NULL ast_channel->_bridge check). 
The connection between the two sides of the local channel is handled 
internally to chan_local and not via normal channel bridging.

If this is in fact the situation, and the frame queue overflow is still 
occurring for this particular user in this scenario, then you're right 
that we have not found a full solution.  If a PBX is on the channel, 
that means an application is running that has as its most fundamental 
job to service the channel.  So, it is fair for chan_local to assume 
that it can feed frames that direction.  It points to there being a 
problem in an application, or within the PBX code, that can end up 
blocking for a significant period of time not servicing the channel.

So, to summarize my proposed way forward:

1) Of course, leave the improvement to the frame queue overflow handling 
to flush the queue when it occurs to at least help lessen the effects 
when the problem occurs.

2) Determine what it was about the changes in chan_local that caused the 
problem with ChanSpy().  Based on what I know now, I still feel very 
confident that those changes in chan_local were appropriate and correct. 
  If it broke ChanSpy, I would really like to understand exactly why to 
help avoid similar errors in the future.

-- 
Russell Bryant
Digium, Inc. | Engineering Manager, Open Source Software
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
www.digium.com -=- www.asterisk.org -=- blogs.asterisk.org



More information about the asterisk-dev mailing list