No subject


Thu Jul 12 09:23:04 CDT 2007


abandon code of EXITWITHTIMEOUT.  Yesterday, I lost 247 callers due to
EXITWITHTIMEOUT.

I observed the "Exiting on time-out cycle" message on my console when all
of my agents were in state "In-use" or "paused".

I am going to try to re-create the situation on a lab box.  I will need a
few days, as the bug does not manifest when I test this with a small scale
setup (i.e. one agent).  I already tested several single agent scenarios
before putting the patch into production.  Somehow membercount is getting
set to negative one, causing the new majik to break just like the old
majik.

Is there any reason why the code has to use these magic values?  In
try_calling(), the options are parsed and a boolean "noption" is set to 1. 
Then if that boolean is true, go_on is given the magic value -1, which is
supposed to ensure that the comparison against membercount will always fail
back in queue_exec().  But my observations suggest that there are
situations in which the comparison will be true even in the absence of the
'n' option being passed to Queue() in the dialplan.

If you've got a boolean value that is only true when n is passed and only
false when n is not passed, why use it to impart magic?  Why not just use
the boolean value.  It would even make the comment match the code:

  /* exit after 'timeout' cycle if 'n' option enabled */
  if (go_on >= qe.parent->membercount) {

Assume for a moment that 'noption' was available in the scope of
queue_exec().  Then this code could read:

  /* exit after 'timeout' cycle if 'n' option enabled */
  if( noption && go_on >= qe.parent->membercount) {

Which would guarantee that this code branch can't accidentally be entered
into.  Right now a more accurate comment is: "exit after 'timeout' cycle if
go_on has what we believe to be an out of range value"

It seems that the changes to app_queue in 1.4.8 and 1.4.9 were made with
the goal of doing things more efficiently (i.e. not having to walk the
member list to answer questions like "how many members are in the queue"),
but without enough edge case analysis.

I've found two edge cases now in which these changes cause the 'n' option
logic to be executed even if the 'n' option wasn't patched.  One has been
fixed and was rolled into 1.4.10, but it seems that there's at least one
other set of circumstances that can cause the bug to manifest.  Neither
edge case would exist if noption was used as a boolean consistently in both
queue_exec() and try_calling().

The code that parses the 'n' option and sets 'noption' to true is in the
try_calling() function.  The code that compares membercount to go_on is in
queue_exec().  go_on is passed by reference from queue_exec() to
try_calling(), whereas noption only exists in try_calling().  I can see why
it may have seemed like a good idea to overload go_on in the manner that it
was, but given the issues that have come up over it, why not just declare
noption in queue_exec() and pass it by reference to try_calling() as well?

try_calling() is static to app_queue, and the only function in app_queue
that calls it is queue_exec().  I can't think of any good reason not to
change the signature of try_calling() so that the 1.4.8 app_queue changes
are easier to read.  As it stands, I doubt anyone reading the code of
app_queue for the first time without the benefit of this bug report would
understand what's going on, as there are no comments to indicate what
values go_on can have an what the various ranges of those values mean.

I'll update this ticket when I have more lab results on the edge case I
observed yesterday. 

Issue History 
Date Modified   Username       Field                    Change               
====================================================================== 
08-08-07 06:04  jfitzgibbon    Note Added: 0068591                          
======================================================================




More information about the asterisk-bugs mailing list