[Asterisk-code-review] sched: Don't allow ast_sched_del to deadlock ast_sched_runq from same... (...asterisk[13])
Walter Doekes
asteriskteam at digium.com
Tue Mar 12 04:25:00 CDT 2019
Walter Doekes has posted comments on this change. ( https://gerrit.asterisk.org/c/asterisk/+/10991 )
Change subject: sched: Don't allow ast_sched_del to deadlock ast_sched_runq from same thread
......................................................................
Patch Set 2:
> "this could potentially have unintended consequences"
The change to the scheduler itself (the pthread_self) check is safe enough: before it would deadlock, now it ERRORs and allows you to continue (with a possibly degraded asterisk).
So the only scary change is the now-not-mandatory unref call in the macro. However, doing it like this changeset, makes it more aligned with how its friend, the AST_SCHED_REPLACE_VARIABLE_UNREF call, already works: I would not be surprised if most of the callers of AST_SCHED_DEL_UNREF already assumed behaviour would be similar.
if (!_res && _data) // <-- only if id > -1 would we do the unref
unrefcall; /* should ref _data! */
(I think the caller in res/res_sorcery_memory_cache.c assumes so, when unreffing cache after getting -1. Correct me if I'm wrong.)
> Tons of things use the scheduler and and written with the old behavior in mind.
Are you sure? I'm only touching the AST_SCHED_DEL_UNREF macro, and the users are only:
$ find . '!' -path './.pc/*' '(' -name '*.c' -o -name '*.cpp' -o -name '*.h' ')' | xargs grep AST_SCHED_DEL_UNREF | sed -e 's/:.*//' | uniq -c
2 ./res/res_sorcery_memory_cache.c
2 ./res/res_rtp_asterisk.c
1 ./res/res_pjsip_pubsub.c
34 ./channels/chan_sip.c
1 ./include/asterisk/sched.h
That is:
- no macro's using the AST_SCHED_DEL_UNREF macro
- 5 occurrences outside of chan_sip; they can be manually reviewed
That leaves us with the chan_sip case:
- I have one example where this *fixes* a real deadlock problem (I'm fairly confident this used to be a double-free in Asterisk 11, as I mentioned in the ticket)
- the other occurrences; I cannot vouch for all of them, but generally the trend seems to be: an item is expected to be scheduled (return non -1) and the item is then unreffed (see [2], but see also [3])
And chan_sip is the place where this *fixes* things.
Making the unref optional, generally only causes a memory leak. Whereas calling it too often (like what happens without the patch) causes double-frees.
I know which one I prefer.
> Have you tried running the Asterisk testsuite against this change?
No. I thought the friendly-automation and its verified+1 did that. I cannot find the particular run in Jenkins though, I believe it may have been pruned by now.
Let me counter that: have you tried running reproducing_sched_unref_self_deadlock.patch[1] ?
[1] https://issues.asterisk.org/jira/secure/attachment/58136/ASTERISK-28282_reproducing_sched_unref_self_deadlock.patch
If you load that into a an Asterisk and do some moderate traffic on it, the deadlock occurs. If you patch Asterisk with this changeset, the deadlock is replaced by a benign ERROR message. (Possibly with a ref-/memleak in the rarest of circumstances.)
Lastly, I have some live behaviour that can at least partially attest to some success [4].
----
[2] Good behaviour, it cleans up immediately (this is the common case I see in chan_sip):
pvt->autokillid = ast_sched_add(sched, ms, __sip_autodestruct, pvt);
if (pvt->autokillid < 0) {
/* Uh Oh. Expect bad behavior. */
dialog_unref(pvt, "Failed to schedule autokillid");
and:
if (-1 < pvt->autokillid) {
append_history(pvt, "CancelDestroy", "");
AST_SCHED_DEL_UNREF(sched, pvt->autokillid,
(dialog->reinviteid, behaves like this too, as does dialog->autokillid and dialog->provisional_keepalive_sched_id)
(dialog->request_queue_sched_id does nothing at all: that's scary actually, does it clean up after someone elses ref?)
(This is *not* an exhaustive list.)
----
[3] Unexpected/bad behaviour, it does an extra ref:
p->waitid = ast_sched_add(sched, wait, sip_reinvite_retry, p);
if (p->waitid < 0) {
/* Uh Oh. Expect bad behavior. */
dialog_ref(p, "Failed to schedule waitid"); // <-- this is not supposed to happen though, and see [5]
And no checks for -1 this time:
AST_SCHED_DEL_UNREF(sched, p->initid, dialog_unref(p, "when you delete the initid sched, you should dec the refcount for the stored dialog ptr"));
But note that if waitid is altered (set to -1) along the road, it is also unreffed. Only if ast_sched_add fails, do we appear to get a refleak after this changeset.
And (as I hinted at above) REPLACE_UNREF already uses the "I expect this to run only if it succeeds"-semantics:
AST_SCHED_REPLACE_UNREF(p->initid, sched, p->timer_b, auto_congest, p,
dialog_unref(_data, "dialog ptr dec when SCHED_REPLACE del op succeeded"),
dialog_unref(p, "dialog ptr dec when SCHED_REPLACE add failed"),
dialog_ref(p, "dialog ptr inc when SCHED_REPLACE add succeeded") );
(Wait what!? Using "_data" which is declared inside the AST_SCHED_REPLACE_UNREF macro? Aw.. now I have to bleach my eyes. Someone even did that in res/res_pjsip_pubsub.c.)
In any case: _data (which would be the previous "p") would only get unreffed if waitid > -1 in that replace case. The AST_SCHED_DEL_UNREF semantics should be the same, IMHO.
----
[4] On a random machine with this changeset, I see this:
# asterisk -nrx 'core show uptime'
System uptime: 1 week, 4 days, 14 hours, 13 minutes, 23 seconds
Last reload: 5 days, 12 hours, 32 minutes, 1 second
# asterisk -nrx 'sip show sched'
Highwater = 593
schedcnt = 276
...
# asterisk -nrx 'core show channels count'; \
asterisk -nrx 'sip show objects' | awk '/^refcount/{n+=1;s+=$2}END{print n " " s}'
274 active channels
139 active calls
804000 calls processed
303 1150
^-- 303 objects, 1150 refs; sounds pretty legit
----
[5] Surely that is a bug?
$ grep -A1 'Expect bad behavior' channels/chan_sip.c | grep '_ref\|_unref'
ao2_t_ref(pkt, -1, "Failed to schedule stop packet retransmission action");
dialog_unref(pvt, "Failed to schedule cancel destroy action");
dialog_unref(pvt, "Failed to schedule autokillid");
dialog_unref(p, "Failed to schedule destroy action");
dialog_unref(pvt, "Failed to schedule stop provisional keepalive action");
dialog_unref(pvt, "Failed to schedule stop reinviteid action");
dialog_unref(p, "Failed to schedule reinviteid");
ao2_t_ref(mwi, -1, "Failed to schedule shutdown MWI subscription action");
ao2_t_ref(mwi, -1, "Failed to schedule mwi resub");
ao2_t_ref(mwi, -1, "Failed to schedule start MWI subscription action");
ao2_t_ref(reg, -1, "Failed to schedule reregister timeout");
ao2_t_ref(reg, -1, "Failed to schedule start reregister timeout action");
ao2_t_ref(reg, -1, "Failed to schedule stop register timeout action");
ao2_t_ref(reg, -1, "Failed to schedule register timeout");
ao2_t_ref(reg, -1, "Failed to schedule start register timeout action");
dialog_unref(pvt, "Failed to schedule check pending actions action");
dialog_unref(pvt, "Failed to schedule stop reinvite retry action");
dialog_ref(p, "Failed to schedule waitid");
dialog_unref(pvt, "Failed to schedule stop t38id action");
dialog_unref(pvt, "Failed to schedule t38id");
dialog_unref(pvt, "Failed to schedule start t38id action");
dialog_unref(pvt, "Failed to schedule stop session timer action");
dialog_unref(pvt, "Failed to schedule start session timer action");
ao2_t_ref(reg, -1, "Failed to schedule cleanup_registration action");
^-- there's only 1 doing a ref instead of an unref; looks like a typo to me
----
--
To view, visit https://gerrit.asterisk.org/c/asterisk/+/10991
To unsubscribe, or for help writing mail filters, visit https://gerrit.asterisk.org/settings
Gerrit-Project: asterisk
Gerrit-Branch: 13
Gerrit-Change-Id: Ic26777fa0732725e6ca7010df17af77a012aa856
Gerrit-Change-Number: 10991
Gerrit-PatchSet: 2
Gerrit-Owner: Walter Doekes <walter+asterisk at wjd.nu>
Gerrit-Reviewer: Friendly Automation
Gerrit-Reviewer: Walter Doekes <walter+asterisk at wjd.nu>
Gerrit-Reviewer: mattf <creslin at digium.com>
Gerrit-Comment-Date: Tue, 12 Mar 2019 09:25:00 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-code-review/attachments/20190312/4993129b/attachment-0001.html>
More information about the asterisk-code-review
mailing list