[asterisk-dev] Possible crash-scenario? Asterisk scheduler.

Matthew Jordan mjordan at digium.com
Wed Jan 29 10:34:51 CST 2014


On Wed, Jan 29, 2014 at 6:26 AM, Steve Davies <davies147 at gmail.com> wrote:
> Hi,
>
> I've been looking at ASTERISK-22079, and what might have caused it. The
> comment I added to the bottom of that ticket does not explain the
> segfault/backtrace that is attached to that ticket, so I have kept digging.
>
> Fundamentally, I would like to know what stops a scheduled event from being
> cancelled while it is already running (eg. if the running event is waiting
> on a lock)? The problem being that both the thread that cancels the event,
> and the running event are both likely to call an unref().
>
> Even worse, I believe that it is possible as a result of the above that
> while a thread is waiting on a lock, the data-structure it is waiting for is
> unref'ed to the point of being destroyed, so it ends up with a lock on a
> freed address.
>
> To try and explain what I mean, -/-- represent different threads. The
> following is probably a worst-case scenario:
>
> - thread1 started
> -- scheduler thread started
> - ao2_obj created (ref = 1)
> - ast_sched_add() called with ref increment (ref = 2)
> - locks ao2_obj (ref = 2, locked)
> -- sched fires event - waiting on lock
> - ast_sched_del_unref() called (ref = 1)
> -- sched event gets lock (ref = 1, locked)
> -- unrefs obj, unlocks and returns 0 to destroy sched. (ref = 0, destroyed)
> - any operation on ao2_obj is now illegal.
>
> Am I missing something obvious? Is there a defence mechanism I am not
> seeing?
>

Not quite, although the scheduler is certainly a bit wonky with its
handling of reference counted objects.

ast_sched_del_unref does not universally decrement the ref count.
Looking at the macro:

#define AST_SCHED_DEL_UNREF(sched, id, refcall)            \
    do { \
        int _count = 0; \
        while (id > -1 && ast_sched_del(sched, id) && ++_count < 10) { \
            usleep(1); \
        } \
        if (_count == 10) \
            ast_log(LOG_WARNING, "Unable to cancel schedule ID %d.
This is probably a bug (%s: %s, line %d).\n", id, __FILE__,
__PRETTY_FUNCTION__, __LINE__); \
        if (id > -1) \
            refcall; \
        id = -1; \
    } while (0);

You'll note that the refcall is only called if ast_sched_del
(eventually) returns a valid ID that it deleted. In general, chan_sip
doesn't let the scheduler re-schedule events itself, which means that
if the scheduler fires the event, then only the event callback should
un-ref the object (thus consuming the reference of the ao2 object when
it was put on the scheduler). If the scheduler doesn't fire the event
and the delete call wins the lock contention, then the delete call
should unref the object, and the event should never fire.

That aside, one of the major problems with the scheduler is that it
was written before reference counting (in particular, ao2 objects)
gained prevalence in the system. As a result, the consumers of the
scheduler attempt to use it with reference counting semantics, when it
would be better off if the scheduler itself actually managed the
lifetime of the objects that it was told to schedule. This has also
resulted in some rather crazy macros attempting to handle the
reference count of the objects in chan_sip - and there may (and
probably is) a bug somewhere in there. I'm just not sure this is it.

A long term goal of ours is to re-write the scheduler to manage the
lifetime of scheduled items itself in a cleaner fashion, and this
would probably resolve whatever bug may be causing this issue.

-- 
Matthew Jordan
Digium, Inc. | Engineering Manager
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at: http://digium.com & http://asterisk.org



More information about the asterisk-dev mailing list