[asterisk-bugs] [JIRA] (ASTERISK-22079) Segfault: INTERNAL_OBJ (user_data=0x6374652f) at astobj2.c:120

Fri Feb 28 12:14:05 CST 2014

    [ https://issues.asterisk.org/jira/browse/ASTERISK-22079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=215782#comment-215782 ] 

Steve Davies edited comment on ASTERISK-22079 at 2/28/14 12:13 PM:
-------------------------------------------------------------------

I can now reproduce this issue - Not easily, but it can be done. There are 2 bugs:

1) The #if 0 block in send_provisional_keepalive_full() should not be disabled for the reasons described in the comment in the function. It enables the easy leaking of a pvt ref, and does not fix the crash.

2) With the above fix applied, the crash becomes a ref-counter assertion instead of an attempt to ref an invalid chunk of memory. My comment above describes the reason for this correctly:

{quote}
2) If update_provisional_keepalive() is called while send_provisional_keepalive_full() is waiting on the PVT lock, then pvt->provisional_keepalive_sched_id will be changed to a new sched_id value by update_provisional_keepalive(), but that new sched_id then may be overwritten with -1 by send_provisional_keepalive_full(), killing the pvt's reference to a schedule and "leaking" the reference.
{quote}

Basically if the keepalive fires at the same time as the keepalive is being updated, you can lose the sched_id value, preventing proper cleanup at call-end. Here is a DEBUG_REFS trace of the bug happening with the #if 0 removed already

{code}
Problem: net Refcount not zero for object b4ef1b50
Object b4ef1b50 history:
   0xb4ef1b50 =1   chan_sip.c:8156:sip_alloc (allocate a dialog(pvt) struct)
   0xb4ef1b50 +1   chan_sip.c:8279:sip_alloc (link pvt into dialogs table) [@1]
   0xb4ef1b50 +1   chan_sip.c:4158:__sip_reliable_xmit (__sip_reliable_xmit: setting pkt->owner) [@2]
   0xb4ef1b50 +1   chan_sip.c:4321:sip_scheddestroy (setting ref as passing into ast_sched_add for __sip_autodestruct) [@3]
   0xb4ef1b50 -1   chan_sip.c:26396:handle_request_do (throw away dialog ptr from find_call at end of routine) [@4]
   0xb4ef1b50 +1   chan_sip.c:8634:find_call (pedantic ao2_find in dialogs) [@3]
   0xb4ef1b50 -1   chan_sip.c:4399:__sip_ack (unref pkt cur->owner dialog from sip ack before freeing pkt) [@4]
   0xb4ef1b50 -1   chan_sip.c:26396:handle_request_do (throw away dialog ptr from find_call at end of routine) [@3]
   0xb4ef1b50 +1   chan_sip.c:8634:find_call (pedantic ao2_find in dialogs) [@2]
   0xb4ef1b50 -1   chan_sip.c:4340:sip_cancel_destroy (remove ref for autokillid) [@3]
   0xb4ef1b50 +1   chan_sip.c:7647:sip_new (sip_new: set chan->tech_pvt to i) [@2]
   0xb4ef1b50 +1   chan_sip.c:26987:start_session_timer (adding session timer ref) [@3]
   0xb4ef1b50 +1   chan_sip.c:4534:update_provisional_keepalive (Increment refcount to pass dialog pointer to sched callback) [@4]
   0xb4ef1b50 -1   chan_sip.c:26396:handle_request_do (throw away dialog ptr from find_call at end of routine) [@5]
   0xb4ef1b50 -1   chan_sip.c:4531:update_provisional_keepalive (when you delete the provisional_keepalive_sched_id, you should dec the refcount for the stored dialog ptr) [@4]
   0xb4ef1b50 +1   chan_sip.c:4534:update_provisional_keepalive (Increment refcount to pass dialog pointer to sched callback) [@3]
   0xb4ef1b50 -1   chan_sip.c:4531:update_provisional_keepalive (when you delete the provisional_keepalive_sched_id, you should dec the refcount for the stored dialog ptr) [@4]
   0xb4ef1b50 +1   chan_sip.c:4534:update_provisional_keepalive (Increment refcount to pass dialog pointer to sched callback) [@3]
   0xb4ef1b50 -1   chan_sip.c:6753:sip_hangup (unref ast->tech_pvt) [@4]
   0xb4ef1b50 +1   chan_sip.c:4321:sip_scheddestroy (setting ref as passing into ast_sched_add for __sip_autodestruct) [@3]
   0xb4ef1b50 -1   chan_sip.c:26948:stop_session_timer (removing session timer ref) [@4]
   0xb4ef1b50 -1   chan_sip.c:6794:sip_hangup (when you delete the provisional_keepalive_sched_id, you should dec the refcount for the stored dialog ptr) [@3]
   0xb4ef1b50 +1   chan_sip.c:4158:__sip_reliable_xmit (__sip_reliable_xmit: setting pkt->owner) [@2]
   0xb4ef1b50 +1   chan_sip.c:8634:find_call (pedantic ao2_find in dialogs) [@3]
   0xb4ef1b50 -1   chan_sip.c:4399:__sip_ack (unref pkt cur->owner dialog from sip ack before freeing pkt) [@4]
   0xb4ef1b50 -1   chan_sip.c:26396:handle_request_do (throw away dialog ptr from find_call at end of routine) [@3]
   0xb4ef1b50 +1   chan_sip.c:3283:dialog_unlink_all (Let's bump the count in the unlink so it doesn't accidentally become dead before we are done) [@2]
   0xb4ef1b50 -1   chan_sip.c:3285:dialog_unlink_all (unlinking dialog via ao2_unlink) [@3]
   0xb4ef1b50 -1   chan_sip.c:3349:dialog_unlink_all (Let's unbump the count in the unlink so the poor pvt can disappear if it is time) [@2]
   0xb4ef1b50 -1   chan_sip.c:4275:__sip_autodestruct (The ref to a dialog passed to this sched callback is going out of scope; unref it.) [@1]
   0xb4ef1b50 **call destructor** chan_sip.c:4275:__sip_autodestruct (The ref to a dialog passed to this sched callback is going out of scope; unref it.)
   0xb4ef1b50 -1   chan_sip.c:4511:send_provisional_keepalive_full (dialog ref for provisional keepalive) [@0]
{code}

I think there is a fundamental issue here with the scheduler - Perhaps the scheduler should pass the sched_id into all of its callbacks, along with the data so that the called function can verify that the sched_id matches what it thinks it is working on, and act differently if necessary?

A quick workaround that will probably reduce the problem to *almost* zero would be for send_provisional_keepalive_full to grab a copy of pvt->provisional_keepalive_sched_id before getting any locks, and check that it is unchanged after the sip_pvt_lock_full() delay has occurred.

If it has changed, then do the ao2_ref(pvt,-1) but do not kill the new value of pvt->provisional_keepalive_sched_id, and don't bother sending the keepalive.

Thoughts? My weekend is starting, I will submit patches next week if no-one does so before me.

P.S. here is the dialplan to cause the issue, just run calls at it constantly until it breaks. It can take several tens of minutes, or it can happen really quickly. I run a call per second at it and it takes between 2 and 30 minutes (usually.)

{code}
exten => 900,1,NoOp(Crash Generator)
same  =>     n,Ringing
same  =>     n,Wait(60)
same  =>     n,Progress
same  =>     n,Hangup
{code}

      was (Author: one47):
    I can now reproduce this issue - Not easily, but it can be done. There are 2 bugs: