[Asterisk-code-review] chan sip: Don't unload module while TCP/TLS threads are stil... (asterisk[master])
Corey Farrell
asteriskteam at digium.com
Thu Dec 1 13:57:41 CST 2016
Corey Farrell has posted comments on this change. ( https://gerrit.asterisk.org/4536 )
Change subject: chan_sip: Don't unload module while TCP/TLS threads are still running
......................................................................
Patch Set 2:
> It looks like you're right about the pthread_mutex_lock. So we
> cannot assume that a signal eventually terminates a thread. This
> means that we need to somehow ensure that deadlocks never happen,
> otherwise we would be leaking threads.
This is true, but any potential deadlocks are separate issues.
> I performed a test with your suggestion of returning -1 if not all
> threads have been terminated. For that, I added the SA_RESTART flag
> to the SIGURG handler (because that makes it easier to get a
> blocked thread), then I unloaded the module while a thread was
> blocked on connect(). connect() came unblocked after its timeout
> was reached and I got an astobj2 bad magic number assertion in the
> ao2_t_unlink(threadt, me) in _sip_tcp_helper_thread, evidently
> because threadt had been cleaned up during module unload. This
> shows that a thread must never live past the point where chan_sip
> waits five seconds during unload for the threads to terminate.
unload_module performs an ao2_cleanup(threadt), this needs to be skipped if threads remain. This is what I meant by needing to do additional testing / tweaks when -1 is returned. It's possible other parts of the chan_sip cleanup will need to remain allocated as well, anything used by _sip_tcp_helper_thread needs to remain valid.
I'm thinking it will be best to allow container contents to be cleaned, but leave all containers allocated if we're returning -1. You could add 'return -1;' immediately after 'ast_debug(2, "TCP/TLS thread container did not become empty :(\n");'. This instead of adding the 'res' variable.
> So we need a way to reliably terminate all TCP threads, no matter
> whether they are (a) running, (b) stuck on a blocking system call,
> (c) temporarily stuck, or (d) irrecoverably deadlocked. a is no
> problem because we have that alert_pipe mechanism. b is possible
> since https://gerrit.asterisk.org/#/c/4494/. c causes the astobj2
> bad magic number issue I described above -- unless we wait for the
> thread to get un-stuck so we can terminate it, which my patch does.
> d would result in a leaked thread in the current master and in an
> infinite loop with my patch, I don't know what can be done about
> that, but does that situation ever arise during normal operation
> (i.e. as long as we don't hit another bug)?
(d) cannot be fixed by unload_module. Once (d) happens it's over, thread(s) will definitely leak. You are approaching this the wrong way -- "as long as we don't hit another bug". I can guarantee we WILL hit other bugs.
Leaking a thread during shutdown can be very low impact / trivial in production environments, freezing during graceful shutdown/restart would be a major bug. We need to focus on minimizing the damage of any leaked threads.
--
To view, visit https://gerrit.asterisk.org/4536
To unsubscribe, visit https://gerrit.asterisk.org/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I23a8abf9e8f73b57ff7ada61984b8506aea6cbde
Gerrit-PatchSet: 2
Gerrit-Project: asterisk
Gerrit-Branch: master
Gerrit-Owner: Michael Kuron <m.kuron at gmx.de>
Gerrit-Reviewer: Anonymous Coward #1000019
Gerrit-Reviewer: Corey Farrell <git at cfware.com>
Gerrit-Reviewer: Michael Kuron <m.kuron at gmx.de>
Gerrit-HasComments: No
More information about the asterisk-code-review
mailing list