[asterisk-dev] Possible crash-scenario? Asterisk scheduler.

Matthew Jordan mjordan at digium.com
Thu Jan 30 11:49:24 CST 2014


On Wed, Jan 29, 2014 at 5:44 PM, Steve Davies <davies147 at gmail.com> wrote:
> Hi,
>
> Thanks for looking at this.
>
> On 29 January 2014 16:34, Matthew Jordan <mjordan at digium.com> wrote:
> [snip]
>>
>> You'll note that the refcall is only called if ast_sched_del
>> (eventually) returns a valid ID that it deleted.
>
>
> Agreed. In my proposed "failure" example, ast_sched_del is called first, and
> is called in many places in chan_sip with an unref function provided. As a
> result, ast_sched_del will indeed decrease the refcount. The ASTERISK-22079
> backtrace refers to a keepalive timer that will be cancelled more often than
> it is needed in most environments, so this cancel code should execute
> regularly.
>
>> In general, chan_sip
>> doesn't let the scheduler re-schedule events itself, which means that
>> if the scheduler fires the event, then only the event callback should
>> un-ref the object (thus consuming the reference of the ao2 object when
>> it was put on the scheduler). If the scheduler doesn't fire the event
>> and the delete call wins the lock contention, then the delete call
>> should unref the object, and the event should never fire.
>
>
> If the scheduler wins, it does not know whether to delete the scheduled job
> from the queue until execution ends because the re-scheduling is determined
> by the return value from the event-call. If the event call is blocked
> waiting for a lock, then there could be plenty of time for ast_sched_del to
> get called from elsewhere between the start and end of the event execution.

While there is a race condition between the callback firing in
ast_sched_runq and ast_sched_del, as ast_sched_runq unlocks con->lock
in order to process the event, it shouldn't cause a double de-ref if
for no other reason than ast_sched_runq deliberately removes the item
from the scheduler's hashtab:

        if (!ast_hashtab_remove_this_object(con->schedq_ht, current)) {
            ast_log(LOG_ERROR,"Sched entry %d was in the schedq list
but not in the hashtab???\n", current->id);
        }

That *should* prevent ast_sched_del from ever succeeding in the case
that it obtains the con->lock while the event is fired. I am curious
if, when this is reproduced, this particular ERROR message showed up
(or any other other off nominals in scheduler).

That does mean that there is a potential memory leak here: a cancel
request attempts to cancel an event that is being fired; the cancel
fails because the callback is executing; upon finishing the callback
the scheduler re-schedules the callback. But that's not the symptom
that the user is reporting in this particular case.

>
> The specific example in ASTERISK-22079 might be fixed by checking whether
> the pvt->keepaliveschedid (variable name is from memory) has changed either
> side of grabbing the pvt and chan locks in the event. If it has changed,
> bail. If that change is right, it should also allow for the big "#if 0
> ..TODO" block in the keepalive code to be reactivated. Perhaps I'll try that
> and put some highly verbose logging in there to indicate what is happening
> in real life?
>

I'd be very interested to see how this is occurring in chan_sip. Some
extra debugging and error checking would probably be helpful - as I
agree there's something a little fishy in how the scheduler is
maintaining its items.


-- 
Matthew Jordan
Digium, Inc. | Engineering Manager
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at: http://digium.com & http://asterisk.org



More information about the asterisk-dev mailing list