[Asterisk-code-review] chan sip: Don't unload module while TCP/TLS threads are stil... (asterisk[master])

Thu Dec 1 13:57:41 CST 2016

Corey Farrell has posted comments on this change. ( https://gerrit.asterisk.org/4536 )

Change subject: chan_sip: Don't unload module while TCP/TLS threads are still running
......................................................................

Patch Set 2:

> It looks like you're right about the pthread_mutex_lock. So we
 > cannot assume that a signal eventually terminates a thread. This
 > means that we need to somehow ensure that deadlocks never happen,
 > otherwise we would be leaking threads.

This is true, but any potential deadlocks are separate issues.

 > I performed a test with your suggestion of returning -1 if not all
 > threads have been terminated. For that, I added the SA_RESTART flag
 > to the SIGURG handler (because that makes it easier to get a
 > blocked thread), then I unloaded the module while a thread was
 > blocked on connect(). connect() came unblocked after its timeout
 > was reached and I got an astobj2 bad magic number assertion in the
 > ao2_t_unlink(threadt, me) in _sip_tcp_helper_thread, evidently
 > because threadt had been cleaned up during module unload. This
 > shows that a thread must never live past the point where chan_sip
 > waits five seconds during unload for the threads to terminate.

unload_module performs an ao2_cleanup(threadt), this needs to be skipped if threads remain.  This is what I meant by needing to do additional testing / tweaks when -1 is returned.  It's possible other parts of the chan_sip cleanup will need to remain allocated as well, anything used by _sip_tcp_helper_thread needs to remain valid.

I'm thinking it will be best to allow container contents to be cleaned, but leave all containers allocated if we're returning -1.  You could add 'return -1;' immediately after 'ast_debug(2, "TCP/TLS thread container did not become empty :(\n");'.  This instead of adding the 'res' variable.

 > So we need a way to reliably terminate all TCP threads, no matter
 > whether they are (a) running, (b) stuck on a blocking system call,
 > (c) temporarily stuck, or (d) irrecoverably deadlocked. a is no
 > problem because we have that alert_pipe mechanism. b is possible
 > since https://gerrit.asterisk.org/#/c/4494/. c causes the astobj2
 > bad magic number issue I described above -- unless we wait for the
 > thread to get un-stuck so we can terminate it, which my patch does.
 > d would result in a leaked thread in the current master and in an
 > infinite loop with my patch, I don't know what can be done about
 > that, but does that situation ever arise during normal operation
 > (i.e. as long as we don't hit another bug)?

(d) cannot be fixed by unload_module.  Once (d) happens it's over, thread(s) will definitely leak.  You are approaching this the wrong way -- "as long as we don't hit another bug".  I can guarantee we WILL hit other bugs.

Leaking a thread during shutdown can be very low impact / trivial in production environments, freezing during graceful shutdown/restart would be a major bug.  We need to focus on minimizing the damage of any leaked threads.

-- 
To view, visit https://gerrit.asterisk.org/4536
To unsubscribe, visit https://gerrit.asterisk.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I23a8abf9e8f73b57ff7ada61984b8506aea6cbde
Gerrit-PatchSet: 2
Gerrit-Project: asterisk
Gerrit-Branch: master
Gerrit-Owner: Michael Kuron <m.kuron at gmx.de>
Gerrit-Reviewer: Anonymous Coward #1000019
Gerrit-Reviewer: Corey Farrell <git at cfware.com>
Gerrit-Reviewer: Michael Kuron <m.kuron at gmx.de>
Gerrit-HasComments: No