[asterisk-bugs] [JIRA] (ASTERISK-25213) Possibility of deadlock in chan_sip INVITE early Replace code

Wed Jul 1 02:07:32 CDT 2015

     [ https://issues.asterisk.org/jira/browse/ASTERISK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Walter Doekes updated ASTERISK-25213:
-------------------------------------

    Attachment: issueA25213-unlock_refer_around_sip_new-PROOF.patch

Thanks for checking and reproducing, Rusty.

This attached proof of concept patch {{issueA25213-unlock_refer_around_sip_new-PROOF.patch}} ran (under valgrind) the entire night without deadlocks \[1\]. This morning I removed the patch again, and it deadlocked in 30 seconds \[2\].

\[1\]:
{noformat}
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
  Call-rate(length)   Port   Total-time  Total-calls  Remote-host
  10.0(0 ms)/1.000s   5073   53245.13 s       164551  127.0.0.1:5060(UDP)

  0 new calls during 1.004 s period      1 ms scheduler resolution
  45 calls (limit 45)                    Peak was 45 calls, after 4 s
  0 Running, 109 Paused, 22 Woken up
  329130 dead call msg (discarded)       0 out-of-call msg (discarded)        
  2 open sockets                        

                                 Messages  Retrans   Timeout   Unexpected-Msg
       Pause [   1000ms]         164551                        0        
      INVITE ---------->         164551    419958    104                
         100 <----------         0         0         0         0        
         401 <----------         164427    0         0         0        
         407 <----------         0         0         0         0        
         180 <----------         0         0         0         0        
         183 <----------         0         0         0         0        
         200 <----------         0         0         0         0        
         ACK ---------->         164427    0                            
      INVITE ---------->         164427    331580    4                  
...
{noformat}

\[2\]:
{noformat}
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
  Call-rate(length)   Port   Total-time  Total-calls  Remote-host
  10.0(0 ms)/1.000s   5073      43.17 s           74  127.0.0.1:5060(UDP)

  0 new calls during 1.004 s period      1 ms scheduler resolution
  45 calls (limit 45)                    Peak was 45 calls, after 4 s
  0 Running, 76 Paused, 13 Woken up
  18 dead call msg (discarded)           0 out-of-call msg (discarded)        
  2 open sockets                        

                                 Messages  Retrans   Timeout   Unexpected-Msg
       Pause [   1000ms]         74                            0        
      INVITE ---------->         74        353       4                  
         100 <----------         0         0         0         0        
         401 <----------         42        0         0         0        
         407 <----------         0         0         0         0        
         180 <----------         0         0         0         0        
         183 <----------         0         0         0         0        
         200 <----------         0         0         0         0        
         ACK ---------->         42        0                            
      INVITE ---------->         42        96        0                  
...
{noformat}

On to fix the XXX part. I don't think the resulting patch will win any prizes, but in the good ol' chan_sip it may be sufficient.

> Possibility of deadlock in chan_sip INVITE early Replace code
> -------------------------------------------------------------
>
>                 Key: ASTERISK-25213
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25213
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 11.18.0, 11.19.0
>            Reporter: Walter Doekes
>         Attachments: callpickup-alice.xml, callpickup-bob-waits-and-tells-charlie.xml, callpickup-charlie.xml, callpickup-extensions.conf, callpickup.sh, callpickup-sip.conf, deadlock-detect.txt, issueA25213-unlock_refer_around_sip_new-PROOF.patch
>
>
> It seems there is a lock order problem with the following code in chan_sip:
> {code}
> static int handle_request_invite(struct sip_pvt *p, struct sip_request *req, struct ast_sockaddr *addr, uint32_t seqno, int *recount, const char *e, int *nounlock)
> ...
>                 /* This locks both refer_call pvt and refer_call pvt's owner!!!*/
>                 if (!error && ast_strlen_zero(pickup.exten) && (p->refer->refer_call = get_sip_pvt_byid_locked(replace_id, totag, fromtag)) == NULL) {
>                         ast_log(LOG_NOTICE, "Supervised transfer attempted to replace non-existent call id (%s)!\n", replace_id);
>                         transmit_response_reliable(p, "481 Call Leg Does Not Exist (Replaces)", req);
>                         error = 1;
>                 } else {
>                         refer_locked = 1;
>                 }
> ...
>                 if (!replace_id && (gotdest != SIP_GET_DEST_EXTEN_FOUND)) {     /* No matching extension found */
> ...
>                 } else {
> ...
>                         /* First invitation - create the channel.  Allocation
>                          * failures are handled below. */
>                         c = sip_new(p, AST_STATE_DOWN, S_OR(p->peername, NULL), NULL, p->logger_callid);
> {code}
> The above code:
> - locks a channel
> - then calls sip_new, which eventually locks the channel list (through __ast_channel_alloc)
> This order is wrong, because most of the other code does the inverse:
> - locks the channel list
> - locks a channel
> This is a problem when:
> - one thread piece of code iterates over the channel list; like say a MASTER_CHANNEL() function call
> - another thread enters the shown code and obtains the channel lock before thread one has iterated over the locked channel
> Now the two threads start waiting on each other.
> Reproducing, see the attached files, starting with {{callpickup}}:
> {noformat}
> $ ls /etc/asterisk/ -1
> asterisk.conf
> callpickup-alice.xml
> callpickup-bob-waits-and-tells-charlie.xml
> callpickup-charlie.xml
> callpickup.sh
> extensions.conf
> logger.conf
> modules.conf
> sip.conf
> {noformat}
> Run like this:
> {noformat}
> shell1# ./callpickup.sh
> shell2# /usr/local/bin/sipp -s devil -ap devil2 -i 127.0.0.1 -nostdin \
>   -nd -default_behaviors bye -sf /etc/asterisk/callpickup-alice.xml \
>   -p 5073 -key tel devil 127.0.0.1
> shell3# ulimit -n 2048; ulimit -c unlimited; asterisk -c
> {noformat}
> With DEBUG_THREADS, DONT_OPTIMIZE and DETECT_DEADLOCKS, on my system this takes less than a minute to trigger a deadlock (which may result in a crash, see ASTERISK-25212).
> Sometimes it doesn't crash, in that case, it shows something like:
> {noformat}
> # grep locked /tmp/deadlock-detect.txt 
> [Jun 30 16:34:43] ERROR[30200][C-00000226]: lock.c:295 __ast_pthread_mutex_lock: chan_sip.c line 8976 (sip_pvt_lock_full): 'c[x]' was locked here.
> [Jun 30 16:34:43] ERROR[30195][C-00000214]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30201][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
> [Jun 30 16:34:43] ERROR[30190][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
> [Jun 30 16:34:43] ERROR[30199][C-0000021a]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30197][C-00000217]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30193][C-00000211]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30191][C-0000020e]: lock.c:295 __ast_pthread_mutex_lock: chan_local.c line 455 (local_queue_frame): 'chan' was locked here.
> [Jun 30 16:34:45] ERROR[29779]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'iter->c' was locked here.
> {noformat}
> I guess the refer_call should be unlocked before we enter sip_new; and then re-acquired.
> Cheers,
> Walter Doekes
> OSSO B.V.

--
This message was sent by Atlassian JIRA
(v6.2#6252)