[asterisk-bugs] [JIRA] (ASTERISK-25213) Possibility of deadlock in chan_sip INVITE early Replace code
Walter Doekes (JIRA)
noreply at issues.asterisk.org
Wed Jul 1 02:07:32 CDT 2015
[ https://issues.asterisk.org/jira/browse/ASTERISK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Walter Doekes updated ASTERISK-25213:
-------------------------------------
Attachment: issueA25213-unlock_refer_around_sip_new-PROOF.patch
Thanks for checking and reproducing, Rusty.
This attached proof of concept patch {{issueA25213-unlock_refer_around_sip_new-PROOF.patch}} ran (under valgrind) the entire night without deadlocks \[1\]. This morning I removed the patch again, and it deadlocked in 30 seconds \[2\].
\[1\]:
{noformat}
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
Call-rate(length) Port Total-time Total-calls Remote-host
10.0(0 ms)/1.000s 5073 53245.13 s 164551 127.0.0.1:5060(UDP)
0 new calls during 1.004 s period 1 ms scheduler resolution
45 calls (limit 45) Peak was 45 calls, after 4 s
0 Running, 109 Paused, 22 Woken up
329130 dead call msg (discarded) 0 out-of-call msg (discarded)
2 open sockets
Messages Retrans Timeout Unexpected-Msg
Pause [ 1000ms] 164551 0
INVITE ----------> 164551 419958 104
100 <---------- 0 0 0 0
401 <---------- 164427 0 0 0
407 <---------- 0 0 0 0
180 <---------- 0 0 0 0
183 <---------- 0 0 0 0
200 <---------- 0 0 0 0
ACK ----------> 164427 0
INVITE ----------> 164427 331580 4
...
{noformat}
\[2\]:
{noformat}
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
Call-rate(length) Port Total-time Total-calls Remote-host
10.0(0 ms)/1.000s 5073 43.17 s 74 127.0.0.1:5060(UDP)
0 new calls during 1.004 s period 1 ms scheduler resolution
45 calls (limit 45) Peak was 45 calls, after 4 s
0 Running, 76 Paused, 13 Woken up
18 dead call msg (discarded) 0 out-of-call msg (discarded)
2 open sockets
Messages Retrans Timeout Unexpected-Msg
Pause [ 1000ms] 74 0
INVITE ----------> 74 353 4
100 <---------- 0 0 0 0
401 <---------- 42 0 0 0
407 <---------- 0 0 0 0
180 <---------- 0 0 0 0
183 <---------- 0 0 0 0
200 <---------- 0 0 0 0
ACK ----------> 42 0
INVITE ----------> 42 96 0
...
{noformat}
On to fix the XXX part. I don't think the resulting patch will win any prizes, but in the good ol' chan_sip it may be sufficient.
> Possibility of deadlock in chan_sip INVITE early Replace code
> -------------------------------------------------------------
>
> Key: ASTERISK-25213
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-25213
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Channels/chan_sip/General
> Affects Versions: 11.18.0, 11.19.0
> Reporter: Walter Doekes
> Attachments: callpickup-alice.xml, callpickup-bob-waits-and-tells-charlie.xml, callpickup-charlie.xml, callpickup-extensions.conf, callpickup.sh, callpickup-sip.conf, deadlock-detect.txt, issueA25213-unlock_refer_around_sip_new-PROOF.patch
>
>
> It seems there is a lock order problem with the following code in chan_sip:
> {code}
> static int handle_request_invite(struct sip_pvt *p, struct sip_request *req, struct ast_sockaddr *addr, uint32_t seqno, int *recount, const char *e, int *nounlock)
> ...
> /* This locks both refer_call pvt and refer_call pvt's owner!!!*/
> if (!error && ast_strlen_zero(pickup.exten) && (p->refer->refer_call = get_sip_pvt_byid_locked(replace_id, totag, fromtag)) == NULL) {
> ast_log(LOG_NOTICE, "Supervised transfer attempted to replace non-existent call id (%s)!\n", replace_id);
> transmit_response_reliable(p, "481 Call Leg Does Not Exist (Replaces)", req);
> error = 1;
> } else {
> refer_locked = 1;
> }
> ...
> if (!replace_id && (gotdest != SIP_GET_DEST_EXTEN_FOUND)) { /* No matching extension found */
> ...
> } else {
> ...
> /* First invitation - create the channel. Allocation
> * failures are handled below. */
> c = sip_new(p, AST_STATE_DOWN, S_OR(p->peername, NULL), NULL, p->logger_callid);
> {code}
> The above code:
> - locks a channel
> - then calls sip_new, which eventually locks the channel list (through __ast_channel_alloc)
> This order is wrong, because most of the other code does the inverse:
> - locks the channel list
> - locks a channel
> This is a problem when:
> - one thread piece of code iterates over the channel list; like say a MASTER_CHANNEL() function call
> - another thread enters the shown code and obtains the channel lock before thread one has iterated over the locked channel
> Now the two threads start waiting on each other.
> Reproducing, see the attached files, starting with {{callpickup}}:
> {noformat}
> $ ls /etc/asterisk/ -1
> asterisk.conf
> callpickup-alice.xml
> callpickup-bob-waits-and-tells-charlie.xml
> callpickup-charlie.xml
> callpickup.sh
> extensions.conf
> logger.conf
> modules.conf
> sip.conf
> {noformat}
> Run like this:
> {noformat}
> shell1# ./callpickup.sh
> shell2# /usr/local/bin/sipp -s devil -ap devil2 -i 127.0.0.1 -nostdin \
> -nd -default_behaviors bye -sf /etc/asterisk/callpickup-alice.xml \
> -p 5073 -key tel devil 127.0.0.1
> shell3# ulimit -n 2048; ulimit -c unlimited; asterisk -c
> {noformat}
> With DEBUG_THREADS, DONT_OPTIMIZE and DETECT_DEADLOCKS, on my system this takes less than a minute to trigger a deadlock (which may result in a crash, see ASTERISK-25212).
> Sometimes it doesn't crash, in that case, it shows something like:
> {noformat}
> # grep locked /tmp/deadlock-detect.txt
> [Jun 30 16:34:43] ERROR[30200][C-00000226]: lock.c:295 __ast_pthread_mutex_lock: chan_sip.c line 8976 (sip_pvt_lock_full): 'c[x]' was locked here.
> [Jun 30 16:34:43] ERROR[30195][C-00000214]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30201][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
> [Jun 30 16:34:43] ERROR[30190][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
> [Jun 30 16:34:43] ERROR[30199][C-0000021a]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30197][C-00000217]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30193][C-00000211]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30191][C-0000020e]: lock.c:295 __ast_pthread_mutex_lock: chan_local.c line 455 (local_queue_frame): 'chan' was locked here.
> [Jun 30 16:34:45] ERROR[29779]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'iter->c' was locked here.
> {noformat}
> I guess the refer_call should be unlocked before we enter sip_new; and then re-acquired.
> Cheers,
> Walter Doekes
> OSSO B.V.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list