[asterisk-dev] [Code Review] 4071: scheduler: Fix a bug introduced by adding a delete flag to scheduled tasks
Jonathan Rose
reviewboard at asterisk.org
Fri Oct 10 13:47:01 CDT 2014
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/4071/
-----------------------------------------------------------
Review request for Asterisk Developers, Matt Jordan and Mark Michelson.
Bugs: ASTERISK-24321
https://issues.asterisk.org/jira/browse/ASTERISK-24321
Repository: Asterisk
Description
-------
This issue was discovered by a rather complicated series of tests by PQ and it's somewhat intermittent relying on hitting the same race conditions that were being solved by r422070. When this problem hits the __sip_ack method in chan_sip in particular, things go south quickly.
The gist of it is that when we attempt to remove an existing task, we mark it for deletion and it is later removed from the scheduler. The deleted entry doesn't get free'd on account of the scheduler caching task structures so that we don't waste a bunch of effort reallocated all of the task structures every time a task needs to be created/torn down. When we wanted to remove a task that was currently executing, we couldn't do this immediately so we would apply a deleted flag. Unfortunately we didn't bother to clear the deleted flag when pulling it back off of the cache to create a new task. Oops. In any event, shenanigans ensued because the new task would be created already doomed and while they would be reported as successfully scheduled, ast_sched_runq would immediately delete the new task without replacing it the chan_sip __sip_ack function was anticipating that the tasks would stick around until it deleted them.
The fix is mind-numbingly simple for how long it took to me to figure out what the heck was going on... just remember to clear the deleted flag from scheduler entries when pulling them off the cache.
Diffs
-----
/branches/12/main/sched.c 425241
Diff: https://reviewboard.asterisk.org/r/4071/diff/
Testing
-------
We had a series of tests that, pre-patch would yield an assertion looking like the following:
[Oct 10 13:12:57] ERROR[18046][C-0000000e]: chan_sip.c:4428 __sip_ack: FRACK!, Failed assertion s != NULL, id=1570 (0)
Got 14 backtrace records
#0: [0x823d4fa] asterisk(__ast_assert_failed+0x7b) [0x823d4fa]
#1: [0x81fe913] asterisk(_ast_sched_del+0x2b3) [0x81fe913]
#2: [0xfeb248] /usr/lib/asterisk/modules/chan_sip.so [0xfeb248]
#3: [0x106c930] /usr/lib/asterisk/modules/chan_sip.so [0x106c930]
#4: [0x106d1e9] /usr/lib/asterisk/modules/chan_sip.so [0x106d1e9]
#5: [0x106cdb4] /usr/lib/asterisk/modules/chan_sip.so [0x106cdb4]
#6: [0x8171359] asterisk(ast_io_wait+0x14d) [0x8171359]
#7: [0x106f0f8] /usr/lib/asterisk/modules/chan_sip.so [0x106f0f8]
#8: [0x82399d5] asterisk [0x82399d5]d an infinite loop in the do_monitor thread couple with this set of log messages:
indicating that we were anticipating to find a scheduler entry that wasn't in the scheduler
Similar assertions occurred from other modules that involved schedulers but no chan_sip, but those didn't clearly break the world.
Typically a walk through the specified tests (which involved lots of chan_sip calls entering queues and lots of reloading of chan_sip and app_queue) would cause this breakdown to occur within one or two walks across the test series suggested. After performing the full set of tests 5 times each across Asterisk on two separate occasions without seeing any assertions of this type and without having chan_sip break down, I retested without the patch and quickly ran into the problem again. I think it's safe to say that I got it.
Thanks,
Jonathan Rose
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20141010/b0b87f33/attachment-0001.html>
More information about the asterisk-dev
mailing list