[Asterisk-code-review] sched: Don't allow ast_sched_del to deadlock ast_sched_runq from same... (...asterisk[13])

Walter Doekes asteriskteam at digium.com
Tue Mar 12 04:25:00 CDT 2019


Walter Doekes has posted comments on this change. ( https://gerrit.asterisk.org/c/asterisk/+/10991 )

Change subject: sched: Don't allow ast_sched_del to deadlock ast_sched_runq from same thread
......................................................................


Patch Set 2:

> "this could potentially have unintended consequences"

The change to the scheduler itself (the pthread_self) check is safe enough: before it would deadlock, now it ERRORs and allows you to continue (with a possibly degraded asterisk).

So the only scary change is the now-not-mandatory unref call in the macro. However, doing it like this changeset, makes it more aligned with how its friend, the AST_SCHED_REPLACE_VARIABLE_UNREF call, already works: I would not be surprised if most of the callers of AST_SCHED_DEL_UNREF already assumed behaviour would be similar.

    if (!_res && _data)   // <-- only if id > -1 would we do the unref
        unrefcall; /* should ref _data! */

(I think the caller in res/res_sorcery_memory_cache.c assumes so, when unreffing cache after getting -1. Correct me if I'm wrong.)


> Tons of things use the scheduler and and written with the old behavior in mind.

Are you sure? I'm only touching the AST_SCHED_DEL_UNREF macro, and the users are only:

$ find . '!' -path './.pc/*' '(' -name '*.c' -o -name '*.cpp' -o -name '*.h' ')' | xargs grep AST_SCHED_DEL_UNREF | sed -e 's/:.*//' | uniq -c
      2 ./res/res_sorcery_memory_cache.c
      2 ./res/res_rtp_asterisk.c
      1 ./res/res_pjsip_pubsub.c
     34 ./channels/chan_sip.c
      1 ./include/asterisk/sched.h

That is:

- no macro's using the AST_SCHED_DEL_UNREF macro
- 5 occurrences outside of chan_sip; they can be manually reviewed

That leaves us with the chan_sip case:

- I have one example where this *fixes* a real deadlock problem (I'm fairly confident this used to be a double-free in Asterisk 11, as I mentioned in the ticket)
- the other occurrences; I cannot vouch for all of them, but generally the trend seems to be: an item is expected to be scheduled (return non -1) and the item is then unreffed (see [2], but see also [3])

And chan_sip is the place where this *fixes* things.

Making the unref optional, generally only causes a memory leak. Whereas calling it too often (like what happens without the patch) causes double-frees.

I know which one I prefer.


> Have you tried running the Asterisk testsuite against this change?

No. I thought the friendly-automation and its verified+1 did that. I cannot find the particular run in Jenkins though, I believe it may have been pruned by now.

Let me counter that: have you tried running reproducing_sched_unref_self_deadlock.patch[1] ?

[1] https://issues.asterisk.org/jira/secure/attachment/58136/ASTERISK-28282_reproducing_sched_unref_self_deadlock.patch

If you load that into a an Asterisk and do some moderate traffic on it, the deadlock occurs. If you patch Asterisk with this changeset, the deadlock is replaced by a benign ERROR message. (Possibly with a ref-/memleak in the rarest of circumstances.)


Lastly, I have some live behaviour that can at least partially attest to some success [4].

----

[2] Good behaviour, it cleans up immediately (this is the common case I see in chan_sip):

    pvt->autokillid = ast_sched_add(sched, ms, __sip_autodestruct, pvt);
    if (pvt->autokillid < 0) {
        /* Uh Oh.  Expect bad behavior. */
        dialog_unref(pvt, "Failed to schedule autokillid");

and:

        if (-1 < pvt->autokillid) {
                append_history(pvt, "CancelDestroy", "");
                AST_SCHED_DEL_UNREF(sched, pvt->autokillid,

(dialog->reinviteid,  behaves like this too, as does dialog->autokillid and dialog->provisional_keepalive_sched_id)
(dialog->request_queue_sched_id does nothing at all: that's scary actually, does it clean up after someone elses ref?)

(This is *not* an exhaustive list.)

----

[3] Unexpected/bad behaviour, it does an extra ref:

    p->waitid = ast_sched_add(sched, wait, sip_reinvite_retry, p);
    if (p->waitid < 0) {
        /* Uh Oh.  Expect bad behavior. */
        dialog_ref(p, "Failed to schedule waitid"); // <-- this is not supposed to happen though, and see [5]

And no checks for -1 this time:

    AST_SCHED_DEL_UNREF(sched, p->initid, dialog_unref(p, "when you delete the initid sched, you should dec the refcount for the stored dialog ptr"));

But note that if waitid is altered (set to -1) along the road, it is also unreffed. Only if ast_sched_add fails, do we appear to get a refleak after this changeset.

And (as I hinted at above) REPLACE_UNREF already uses the "I expect this to run only if it succeeds"-semantics:

    AST_SCHED_REPLACE_UNREF(p->initid, sched, p->timer_b, auto_congest, p,
        dialog_unref(_data, "dialog ptr dec when SCHED_REPLACE del op succeeded"),
        dialog_unref(p, "dialog ptr dec when SCHED_REPLACE add failed"),
        dialog_ref(p, "dialog ptr inc when SCHED_REPLACE add succeeded") );

(Wait what!? Using "_data" which is declared inside the AST_SCHED_REPLACE_UNREF macro? Aw.. now I have to bleach my eyes. Someone even did that in res/res_pjsip_pubsub.c.)

In any case: _data (which would be the previous "p") would only get unreffed if waitid > -1 in that replace case. The AST_SCHED_DEL_UNREF semantics should be the same, IMHO.

----

[4] On a random machine with this changeset, I see this:

  # asterisk -nrx 'core show uptime'
  System uptime: 1 week, 4 days, 14 hours, 13 minutes, 23 seconds
  Last reload: 5 days, 12 hours, 32 minutes, 1 second

  # asterisk -nrx 'sip show sched'
   Highwater = 593
   schedcnt = 276
  ...

  # asterisk -nrx 'core show channels count'; \
    asterisk -nrx 'sip show objects' | awk '/^refcount/{n+=1;s+=$2}END{print n " " s}'
  274 active channels
  139 active calls
  804000 calls processed
  303 1150

  ^-- 303 objects, 1150 refs; sounds pretty legit

----

[5] Surely that is a bug?

$ grep -A1 'Expect bad behavior'  channels/chan_sip.c | grep '_ref\|_unref'
		ao2_t_ref(pkt, -1, "Failed to schedule stop packet retransmission action");
		dialog_unref(pvt, "Failed to schedule cancel destroy action");
		dialog_unref(pvt, "Failed to schedule autokillid");
		dialog_unref(p, "Failed to schedule destroy action");
		dialog_unref(pvt, "Failed to schedule stop provisional keepalive action");
		dialog_unref(pvt, "Failed to schedule stop reinviteid action");
						dialog_unref(p, "Failed to schedule reinviteid");
		ao2_t_ref(mwi, -1, "Failed to schedule shutdown MWI subscription action");
		ao2_t_ref(mwi, -1, "Failed to schedule mwi resub");
		ao2_t_ref(mwi, -1, "Failed to schedule start MWI subscription action");
		ao2_t_ref(reg, -1, "Failed to schedule reregister timeout");
		ao2_t_ref(reg, -1, "Failed to schedule start reregister timeout action");
		ao2_t_ref(reg, -1, "Failed to schedule stop register timeout action");
		ao2_t_ref(reg, -1, "Failed to schedule register timeout");
		ao2_t_ref(reg, -1, "Failed to schedule start register timeout action");
		dialog_unref(pvt, "Failed to schedule check pending actions action");
		dialog_unref(pvt, "Failed to schedule stop reinvite retry action");
					dialog_ref(p, "Failed to schedule waitid");
		dialog_unref(pvt, "Failed to schedule stop t38id action");
		dialog_unref(pvt, "Failed to schedule t38id");
		dialog_unref(pvt, "Failed to schedule start t38id action");
		dialog_unref(pvt, "Failed to schedule stop session timer action");
		dialog_unref(pvt, "Failed to schedule start session timer action");
		ao2_t_ref(reg, -1, "Failed to schedule cleanup_registration action");

^-- there's only 1 doing a ref instead of an unref; looks like a typo to me

----


-- 
To view, visit https://gerrit.asterisk.org/c/asterisk/+/10991
To unsubscribe, or for help writing mail filters, visit https://gerrit.asterisk.org/settings

Gerrit-Project: asterisk
Gerrit-Branch: 13
Gerrit-Change-Id: Ic26777fa0732725e6ca7010df17af77a012aa856
Gerrit-Change-Number: 10991
Gerrit-PatchSet: 2
Gerrit-Owner: Walter Doekes <walter+asterisk at wjd.nu>
Gerrit-Reviewer: Friendly Automation
Gerrit-Reviewer: Walter Doekes <walter+asterisk at wjd.nu>
Gerrit-Reviewer: mattf <creslin at digium.com>
Gerrit-Comment-Date: Tue, 12 Mar 2019 09:25:00 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-code-review/attachments/20190312/4993129b/attachment-0001.html>


More information about the asterisk-code-review mailing list