[asterisk-dev] HANGUPCAUSE variable wrong

Wed Aug 1 09:12:10 CDT 2012

On 08/01/2012 12:34 AM, Pavel Troller wrote:
>> On 07/30/2012 10:57 PM, Pavel Troller wrote:
>>> Hi!
>>>     My systems started to show strange behaviour recently (1.8 branch). It seems
>>> to be related to on of the latest updates, but I was not able to track it down
>>> yet.
>>>     The problem appears in the following scenario:
>>>     - There is a call attempt to a destination, which rejects the call
>>>       (the Dial() command return with some clear cause, for example 38).
>>>     - So there is another connection attempt to another destination, which is
>>>       successful and the call is established
>>>     - This secondary destination is of the DAHDI/PRI type
>>>     - The call is then released by the CALLED party
>>>     - The dialplan continues (Dial() with 'g' option)
>>>     In this case, the HANGUPCAUSE pseudo variable is not updated with the proper
>>> cause (16) and still uses the one from the first connection (38).
>>>     The following snippet from the debug log illustrates the problem, it shows the
>>> clearing part of the connection:
>>>
>>> [Jul 30 20:27:11] DEBUG[23000] channel.c: Didn't get a frame from channel: DAHDI/i2/602376534-a
>>> [Jul 30 20:27:11] DEBUG[23000] res_rtp_asterisk.c: Changing ssrc from 546686679 to 1038239072 due to a source change
>>> [Jul 30 20:27:11] DEBUG[23000] channel.c: Bridge stops bridging channels SCCP/2151-00000048 and DAHDI/i2/602376534-a
>>> [Jul 30 20:27:11] DEBUG[23000] pbx.c: Evaluating 'CDR(xsrc)' (from 'CDR(xsrc)}|${CDR(xdst)}|${CDR(sentno)}|${CDR(billno)}|${CDR(channel)}|${CDR(dstchannel)}|${CDR(sclass)}|${CDR(dclass)}|${CDR(start)}|${CDR(duration)}|${CDR(billsec)}|${CDR(cause)}|${CDR(disposition)}|${CDR(uniqueid)}|${CDR(src)}
>>> ' len 9)
>>> [Jul 30 20:27:11] DEBUG[23000] pbx.c: Function result is '2151'
>>> ... A lot of CDR evaluations skipped
>>> [Jul 30 20:27:11] DEBUG[23000] channel.c: Hanging up channel 'DAHDI/i2/602376534-a'
>>> [Jul 30 20:27:11] DEBUG[23000] chan_dahdi.c: dahdi_hangup(DAHDI/i2/602376534-a)
>>> [Jul 30 20:27:11] DEBUG[23000] chan_dahdi.c: Set option AUDIO MODE, value: ON(1) on DAHDI/i2/602376534-a
>>> [Jul 30 20:27:11] DEBUG[23000] sig_pri.c: sig_pri_hangup 32
>>> [Jul 30 20:27:11] DEBUG[23000] sig_pri.c: Not yet hungup...  Calling hangup with cause 16, and clearing call
>>> [Jul 30 20:27:11] DEBUG[23000] chan_dahdi.c: Disabled echo cancellation on channel 32
>>> [Jul 30 20:27:11] DEBUG[23000] chan_dahdi.c: Set option TDD MODE, value: OFF(0) on DAHDI/i2/602376534-a
>>> [Jul 30 20:27:11] DEBUG[23000] chan_dahdi.c: Updated conferencing on 32, with 0 conference users
>>> [Jul 30 20:27:11] DEBUG[23000] chan_dahdi.c: Set option AUDIO MODE, value: OFF(0) on DAHDI/i2/602376534-a
>>> [Jul 30 20:27:11] DEBUG[23000] app_dial.c: Exiting with DIALSTATUS=ANSWER.
>>> [Jul 30 20:27:11] DEBUG[23000] app_macro.c: Executed application: Dial
>>> [Jul 30 20:27:11] DEBUG[27171] devicestate.c: No provider found, checking channel drivers for DAHDI - i2/602376534
>>> [Jul 30 20:27:11] DEBUG[23000] pbx.c: Result of 'HANGUPCAUSE' is '38'
>>> [Jul 30 20:27:11] DEBUG[23000] pbx.c: Launching 'Set'
>>> [Jul 30 20:27:11] DEBUG[23000] app_macro.c: Executed application: Set
>>> [Jul 30 20:27:11] DEBUG[27171] devicestate.c: Changing state for DAHDI/i2/602376534 - state 0 (Unknown)
>>> [Jul 30 20:27:11] DEBUG[23000] pbx.c: Result of 'DIALSTATUS' is 'ANSWER'
>>> [Jul 30 20:27:11] DEBUG[27171] devicestate.c: device 'DAHDI/i2/602376534' state '0'
>>> [Jul 30 20:27:11] DEBUG[23000] pbx.c: Result of 'HANGUPCAUSE' is '38'
>>>
>>>     You can see that the clear cause is 16 in our case, but HANGUPCAUSE returns
>>> 38 (the result from the first connection).
>>>     If the second connection attempt fails too (with another cause, say 17),
>>> it is properly copied to the HANGUPCAUSE pseudo variable.
>>>     As the HANGUPCAUSE is just reading of the hangupcause member of the channel
>>> structure, it seems that this member is not updated in the scenario described
>>> above.
>>>     It seems from the debug log, that the clear is not initiated by the DSS1
>>> signalling (DISC/REL messages), but by the lack of incoming frames from the
>>> channel. May be this is the reason ?
>>>     I would like to track this issue by myself to learn more of the Asterisk
>>> internal logic, so please could you just point me to a place, where the clear
>>> cause is to be written to the calling channel struct ? I will make some
>>> debugging, find the cause and possibly write a patch. I will open the issue
>>> then.
>>>     With regards,
>>>       Pavel
>> Thanks for taking the initiative to try to learn some Asterisk code!
>>
>> The problem with the hangupcause of a channel is that it can be set in *a
>> lot* of places throughout the code. What you're interested in is the
>> hangupcause of the inbound channel for a Dial() attempt. Most channel
>> drivers are going to set the hangupcause for the outbound channel, and that
>> value will get copied to the inbound channel within app_dial.c itself. Most
>> of the time, this happens during wait_for_answer(), when we're waiting for
>> the outbound leg of the call to be answered.
>>
>> Given that you're seeing the cause get set properly on the outbound
>> channel, my guess is that it simply is not being copied to the inbound
>> channel properly. You should be able to limit your search for hangupcause
>> setting in app_dial.c.
>>
>> You have already figured out that if the second call is rejected, then the
>> hangupcause is copied properly to the inbound channel, so the key here is
>> finding out what is different for the case where the call is answered.
>> Also, it would be a good idea to see if your options to Dial() are
>> affecting things.
>>
>> Regarding your observation about the lack of incoming frames, this is
>> typically the way that Asterisk indicates that the call has been hung up,
>> so I think it is unlikely that this is the root cause of the problem. I
>> think that what you need to do is look through app_dial and find places
>> where the hangupcause is and is not set. There may be some code path where
>> it is not set and it should be. For calls that are answered, you should
>> look in dial_exec_full() after the call to ast_bridge_call() to find where
>> the hangupcause for the inbound channel gets set.
>>
>> Hopefully that will set you on the proper path!
>> Mark Michelson
>>
> Hi Mark!
>    Many thanks for your help! With it, it was just a 30-minute exercise and
> the problem was found and fixed!
>    You were absolutely right, the problem was in wait_for_answer(). The answer
> condition is checked twice: The first check begins like this:
>
>            if (ast_test_flag64(o, DIAL_STILLGOING) && c->_state == AST_STATE_UP) {
>                 if (!peer) {
>                        ast_verb(3, "%s answered %s\n", c->name, in->name);
> ...
>    and the second:
>
>             case AST_FRAME_CONTROL:
>                  switch (f->subclass.integer) {
>                  case AST_CONTROL_ANSWER:
>                        /* This is our guy if someone answered. */
>                        if (!peer) {
>                              ast_verb(3, "%s answered %s\n", c->name, in->name);
> ...
>    While I perfectly understand the way the second check works - it's detection
> of the ANSWER control frame, the first check isn't so clear to me -
> I understand, that the channel state somewhat changes to UP, but I don't
> currently understand, who does it - I suspect the channel driver, or I may be
> wrong. What I really don't understand is, why there are these two variants and
> not only one, using the ANSWER frame, IMHO it would be cleaner. Why is there
> the possibility just to set the state to UP and not to send the ANSWER frame
> regularly instead ?

This is one of those cases that I'll admit that I don't fully understand 
either. There is one case for this that I do understand, and that is 
when a chan_agent channel is called. Those channels are already 
"answered" but then change states to up when they press the button to 
accept the call. I imagine that hitting this case depends on the channel 
driver in use for the outbound call leg more than anything else.
>    This was also the primary cause for the problem - while in the second case,
> there are the following lines
>           /* If call has been answered, then the eventual hangup is likely to be normal hangup */
>           in->hangupcause = AST_CAUSE_NORMAL_CLEARING;
>           c->hangupcause = AST_CAUSE_NORMAL_CLEARING;
>    they are missing in the first case. And of course, in my call scenario, the
> fist code path is being executed, and the causes were not updated there.
>    So, I just copied the above code block to the end of the first case, and the
> HANGUPCAUSE now works fully as expected.
>
>    What I cannot judgei, is whether this fix is clean and may be accepted,
> or whether I just found some more complex problem, which should be fixed
> somewhere else. What confuses me, is that the latest update of app_dial.c
> is from 25. of May, while I started to observe this problem really recently,
> just about one or two weeks back (I'm updating my Asterisk regularly from the
> 1.8 SVN, obviously once per week). It looks that something has changed in some
> other place, which caused that the answer handling changed from the second
> to the first code path for DAHDI/PRI calls. There really were some changes
> in chan_dahdi and friends (sig_pri etc.), so maybe one of them caused this
> change.
>    So, if you think it's OK, I'm ready to open the issue in Jira and post
> a patch (I already have a license for it).
I haven't given it a whole lot of thought, but the patch you suggest 
makes sense to me. If you are worried that the problem may come from 
some other place, then I recommend trying to pinpoint the revision to 
Asterisk that caused the behavior change. If you have a date where you 
know things worked properly, then I'd start from there and start 
checking revisions from that point to see when the problem starts 
happening. Depending on the time period, you may want to do a 
bisect-like method of checking revisions to save you some time.

Once you think you've understood everything, go ahead and open a JIRA 
issue with your patch attached. We may not get to it immediately, but 
we'll get to it.

Mark Michelson
>    With regards,
>      Pavel
>    
>
> --
> _____________________________________________________________________
> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>
> asterisk-dev mailing list
> To UNSUBSCRIBE or update options visit:
>     http://lists.digium.com/mailman/listinfo/asterisk-dev