[asterisk-bugs] [JIRA] (ASTERISK-25213) Possibility of deadlock in chan_sip INVITE early Replace code

Tue Jun 30 14:13:33 CDT 2015

     [ https://issues.asterisk.org/jira/browse/ASTERISK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rusty Newton updated ASTERISK-25213:
------------------------------------

    Description: 
It seems there is a lock order problem with the following code in chan_sip:

{code}
static int handle_request_invite(struct sip_pvt *p, struct sip_request *req, struct ast_sockaddr *addr, uint32_t seqno, int *recount, const char *e, int *nounlock)
...
                /* This locks both refer_call pvt and refer_call pvt's owner!!!*/
                if (!error && ast_strlen_zero(pickup.exten) && (p->refer->refer_call = get_sip_pvt_byid_locked(replace_id, totag, fromtag)) == NULL) {
                        ast_log(LOG_NOTICE, "Supervised transfer attempted to replace non-existent call id (%s)!\n", replace_id);
                        transmit_response_reliable(p, "481 Call Leg Does Not Exist (Replaces)", req);
                        error = 1;
                } else {
                        refer_locked = 1;
                }
...
                if (!replace_id && (gotdest != SIP_GET_DEST_EXTEN_FOUND)) {     /* No matching extension found */
...
                } else {
...
                        /* First invitation - create the channel.  Allocation
                         * failures are handled below. */

                        c = sip_new(p, AST_STATE_DOWN, S_OR(p->peername, NULL), NULL, p->logger_callid);
{code}

The above code:
- locks a channel
- then calls sip_new, which eventually locks the channel list (through __ast_channel_alloc)

This order is wrong, because most of the other code does the inverse:
- locks the channel list
- locks a channel

This is a problem when:
- one thread piece of code iterates over the channel list; like say a MASTER_CHANNEL() function call
- another thread enters the shown code and obtains the channel lock before thread one has iterated over the locked channel

Now the two threads start waiting on each other.

Reproducing, see the attached files, starting with {{callpickup}}:
{noformat}
$ ls /etc/asterisk/ -1
asterisk.conf
callpickup-alice.xml
callpickup-bob-waits-and-tells-charlie.xml
callpickup-charlie.xml
callpickup.sh
extensions.conf
logger.conf
modules.conf
sip.conf
{noformat}

Run like this:
{noformat}
shell1# ./callpickup.sh
shell2# /usr/local/bin/sipp -s devil -ap devil2 -i 127.0.0.1 -nostdin \
  -nd -default_behaviors bye -sf /etc/asterisk/callpickup-alice.xml \
  -p 5073 -key tel devil 127.0.0.1
shell3# ulimit -n 2048; ulimit -c unlimited; asterisk -c
{noformat}

With DEBUG_THREADS, DONT_OPTIMIZE and DETECT_DEADLOCKS, on my system this takes less than a minute to trigger a deadlock (which may result in a crash, see ASTERISK-25212).
Sometimes it doesn't crash, in that case, it shows something like:

{noformat}
# grep locked /tmp/deadlock-detect.txt 
[Jun 30 16:34:43] ERROR[30200][C-00000226]: lock.c:295 __ast_pthread_mutex_lock: chan_sip.c line 8976 (sip_pvt_lock_full): 'c[x]' was locked here.
[Jun 30 16:34:43] ERROR[30195][C-00000214]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30201][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
[Jun 30 16:34:43] ERROR[30190][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
[Jun 30 16:34:43] ERROR[30199][C-0000021a]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30197][C-00000217]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30193][C-00000211]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30191][C-0000020e]: lock.c:295 __ast_pthread_mutex_lock: chan_local.c line 455 (local_queue_frame): 'chan' was locked here.
[Jun 30 16:34:45] ERROR[29779]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'iter->c' was locked here.
{noformat}

I guess the refer_call should be unlocked before we enter sip_new; and then re-acquired.

Cheers,
Walter Doekes
OSSO B.V.

  was:
It seems there is a lock order problem with the following code in chan_sip:

{code}
static int handle_request_invite(struct sip_pvt *p, struct sip_request *req, struct ast_sockaddr *addr, uint32_t seqno, int *recount, const char *e, int *nounlock)
...
                /* This locks both refer_call pvt and refer_call pvt's owner!!!*/
                if (!error && ast_strlen_zero(pickup.exten) && (p->refer->refer_call = get_sip_pvt_byid_locked(replace_id, totag, fromtag)) == NULL) {
                        ast_log(LOG_NOTICE, "Supervised transfer attempted to replace non-existent call id (%s)!\n", replace_id);
                        transmit_response_reliable(p, "481 Call Leg Does Not Exist (Replaces)", req);
                        error = 1;
                } else {
                        refer_locked = 1;
                }
...
                if (!replace_id && (gotdest != SIP_GET_DEST_EXTEN_FOUND)) {     /* No matching extension found */
...
                } else {
...
                        /* First invitation - create the channel.  Allocation
                         * failures are handled below. */

                        c = sip_new(p, AST_STATE_DOWN, S_OR(p->peername, NULL), NULL, p->logger_callid);
{code}

The above code:
- locks a channel
- then calls sip_new, which eventually locks the channel list (through __ast_channel_alloc)

This order is wrong, because most of the other code does the inverse:
- locks the channel list
- locks a channel

This is a problem when:
- one thread piece of code iterates over the channel list; like say a MASTER_CHANNEL() function call
- another thread enters the shown code and obtains the channel lock before thread one has iterated over the locked channel

Now the two threads start waiting on each other.

Reproducing, see the attached files, starting with {{callpickup}}:
{noformat}
$ ls /etc/asterisk/ -1
asterisk.conf
callpickup-alice.xml
callpickup-bob-waits-and-tells-charlie.xml
callpickup-charlie.xml
callpickup.sh
extensions.conf
logger.conf
modules.conf
sip.conf
{noformat}

Run like this:
{noformat}
shell1# ./callpickup.sh
shell2# /usr/local/bin/sipp -s devil -ap devil2 -i 127.0.0.1 -nostdin \
  -nd -default_behaviors bye -sf /etc/asterisk/callpickup-alice.xml \
  -p 5073 -key tel devil 127.0.0.1
shell3# ulimit -n 2048; ulimit -c unlimited; asterisk -c
{noformat}

With DEBUG_THREADS, DONT_OPTIMIZE and DEADLOCK_DETECT, on my system this takes less than a minute to trigger a deadlock (which may result in a crash, see ASTERISK-25212).
Sometimes it doesn't crash, in that case, it shows something like:

{noformat}
# grep locked /tmp/deadlock-detect.txt 
[Jun 30 16:34:43] ERROR[30200][C-00000226]: lock.c:295 __ast_pthread_mutex_lock: chan_sip.c line 8976 (sip_pvt_lock_full): 'c[x]' was locked here.
[Jun 30 16:34:43] ERROR[30195][C-00000214]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30201][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
[Jun 30 16:34:43] ERROR[30190][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
[Jun 30 16:34:43] ERROR[30199][C-0000021a]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30197][C-00000217]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30193][C-00000211]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30191][C-0000020e]: lock.c:295 __ast_pthread_mutex_lock: chan_local.c line 455 (local_queue_frame): 'chan' was locked here.
[Jun 30 16:34:45] ERROR[29779]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'iter->c' was locked here.
{noformat}

I guess the refer_call should be unlocked before we enter sip_new; and then re-acquired.

Cheers,
Walter Doekes
OSSO B.V.

> Possibility of deadlock in chan_sip INVITE early Replace code
> -------------------------------------------------------------
>
>                 Key: ASTERISK-25213
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25213
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>    Affects Versions: 11.19.0
>            Reporter: Walter Doekes
>         Attachments: callpickup-alice.xml, callpickup-bob-waits-and-tells-charlie.xml, callpickup-charlie.xml, callpickup-extensions.conf, callpickup.sh, callpickup-sip.conf, deadlock-detect.txt
>
>
> It seems there is a lock order problem with the following code in chan_sip:
> {code}
> static int handle_request_invite(struct sip_pvt *p, struct sip_request *req, struct ast_sockaddr *addr, uint32_t seqno, int *recount, const char *e, int *nounlock)
> ...
>                 /* This locks both refer_call pvt and refer_call pvt's owner!!!*/
>                 if (!error && ast_strlen_zero(pickup.exten) && (p->refer->refer_call = get_sip_pvt_byid_locked(replace_id, totag, fromtag)) == NULL) {
>                         ast_log(LOG_NOTICE, "Supervised transfer attempted to replace non-existent call id (%s)!\n", replace_id);
>                         transmit_response_reliable(p, "481 Call Leg Does Not Exist (Replaces)", req);
>                         error = 1;
>                 } else {
>                         refer_locked = 1;
>                 }
> ...
>                 if (!replace_id && (gotdest != SIP_GET_DEST_EXTEN_FOUND)) {     /* No matching extension found */
> ...
>                 } else {
> ...
>                         /* First invitation - create the channel.  Allocation
>                          * failures are handled below. */
>                         c = sip_new(p, AST_STATE_DOWN, S_OR(p->peername, NULL), NULL, p->logger_callid);
> {code}
> The above code:
> - locks a channel
> - then calls sip_new, which eventually locks the channel list (through __ast_channel_alloc)
> This order is wrong, because most of the other code does the inverse:
> - locks the channel list
> - locks a channel
> This is a problem when:
> - one thread piece of code iterates over the channel list; like say a MASTER_CHANNEL() function call
> - another thread enters the shown code and obtains the channel lock before thread one has iterated over the locked channel
> Now the two threads start waiting on each other.
> Reproducing, see the attached files, starting with {{callpickup}}:
> {noformat}
> $ ls /etc/asterisk/ -1
> asterisk.conf
> callpickup-alice.xml
> callpickup-bob-waits-and-tells-charlie.xml
> callpickup-charlie.xml
> callpickup.sh
> extensions.conf
> logger.conf
> modules.conf
> sip.conf
> {noformat}
> Run like this:
> {noformat}
> shell1# ./callpickup.sh
> shell2# /usr/local/bin/sipp -s devil -ap devil2 -i 127.0.0.1 -nostdin \
>   -nd -default_behaviors bye -sf /etc/asterisk/callpickup-alice.xml \
>   -p 5073 -key tel devil 127.0.0.1
> shell3# ulimit -n 2048; ulimit -c unlimited; asterisk -c
> {noformat}
> With DEBUG_THREADS, DONT_OPTIMIZE and DETECT_DEADLOCKS, on my system this takes less than a minute to trigger a deadlock (which may result in a crash, see ASTERISK-25212).
> Sometimes it doesn't crash, in that case, it shows something like:
> {noformat}
> # grep locked /tmp/deadlock-detect.txt 
> [Jun 30 16:34:43] ERROR[30200][C-00000226]: lock.c:295 __ast_pthread_mutex_lock: chan_sip.c line 8976 (sip_pvt_lock_full): 'c[x]' was locked here.
> [Jun 30 16:34:43] ERROR[30195][C-00000214]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30201][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
> [Jun 30 16:34:43] ERROR[30190][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
> [Jun 30 16:34:43] ERROR[30199][C-0000021a]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30197][C-00000217]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30193][C-00000211]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
> [Jun 30 16:34:43] ERROR[30191][C-0000020e]: lock.c:295 __ast_pthread_mutex_lock: chan_local.c line 455 (local_queue_frame): 'chan' was locked here.
> [Jun 30 16:34:45] ERROR[29779]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'iter->c' was locked here.
> {noformat}
> I guess the refer_call should be unlocked before we enter sip_new; and then re-acquired.
> Cheers,
> Walter Doekes
> OSSO B.V.

--
This message was sent by Atlassian JIRA
(v6.2#6252)