[asterisk-bugs] [JIRA] Commented: (ASTERISK-20212) Deadlock / TCP SIP Stack

Mon Aug 20 11:28:07 CDT 2012

    [ https://issues.asterisk.org/jira/browse/ASTERISK-20212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=195983#comment-195983 ] 

Mark Michelson commented on ASTERISK-20212:
-------------------------------------------

Welp, let me give some analysis of what's going on here, based on the core show locks and backtrace that you've given me.

There are three threads vying for the same lock.

Thread 48 is the "monitor thread" of chan_sip. It handles incoming UDP SIP messages as well as running scheduled SIP tasks. In this case, it is stuck trying to grab a lock because it is trying to indicate to a TCP client thread that it needs to send an OPTIONS request out to a specific peer.

Thread 3 is the TCP server thread. It handles new incoming TCP connections. In this case, it is handling an incoming REGISTER request, presumably from the same peer that Thread 48 corresponds with. On an incoming REGISTER, the thread that receives the REGISTER also attempts to indicate that an OPTIONS request should be sent. Much like Thread 48, this thread is stuck trying to grab the lock that it needs to grab before it can indicate that the TCP client thread needs to send an OPTIONS request.

Thread 2 is the culprit and is unfortunately also a mystery. Thread 2 is the TCP client thread that threads 3 and 48 are trying to signal to send an OPTIONS request. This thread handles incoming and outgoing SIP messages on this OPTIONS connection. The thread operates by calling ast_poll() (a wrapper for poll()) to find out if there is data to be read or written on the TCP socket. ast_poll() has indicated that there is data to read, so a loop is entered to read the data line-by-line from the request. The fgets() call to read the message data is, for some reason, blocking and not returning. This should not be happening given that ast_poll() claimed there was data to be read.

So if a TCP client thread gets stuck in a situation like this, it eventually will also block incoming UDP and TCP traffic from all sources. The question to answer is why the TCP client thread is blocking to begin with.

One thing that jumps to mind and that seems somewhat fishy to me is that our call to ast_poll() switches on both POLLIN and POLLPRI. This means that the poll will be interrupted both by inbound data to read as well as "urgent" TCP out-of-band data. I'm curious if the poll is being interrupted by an out-of-band TCP message, which in turn does not result in fgets() being able to read any data. I'm going to supply a patch for you that tests this as a possibility. I'm not guaranteeing anything right now; this is mostly a guess.

Now, if my patch does not work for you and you experience the deadlock again, then I'll need you to do a couple of things for me.

First, I'll want the same data you got for me before (core show locks output and a backtrace of all threads). In the backtrace, try to find the thread that is blocking in the fgets() call. In the backtrace you uploaded before, it was Thread 2. In gdb, navigate to that thread by first typing "thread <threadnum>". From there, navigate to the stack frame corresponding to _sip_tcp_helper_thread. In the backtrace you provided here, that is frame 9. You can navigate to that frame by typing "frame 9". Now that you're at that frame, I want you to type "print req.data" . This will let me know if the SIP message came through partially or not.

Second, I'll want you to upload a full debug log from a run where the problem occurred. This may enlighten me if the remote endpoint for some reason closed the TCP connection or something of that nature.

> Deadlock / TCP SIP Stack
> ------------------------
>
>                 Key: ASTERISK-20212
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-20212
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/TCP-TLS
>    Affects Versions: 10.7.0
>         Environment: CentOS 5.7 / Asterisk 10.7.0
>            Reporter: Phil Ciccone
>         Attachments: backtrace-threads.txt, core-show-locks.txt
>
>
> We have been experiencing Asterisk deadlocks for the past week, from at the time, what seemed like nowhere. We upgraded to the latest 10.7.0 stable and enabled the debug flag and disabled optimizations in hopes to catch the deadlock. It seems from looking at the locks we have some sort of problem with our TCP connected SIP devices. Things are now starting to make sense as connecting TCP SIP devices is something newer that we started to do. The deadlocks we experience happen anywhere from every-other-day to several times a day, all involving a restart. Would you be so kind as to look at the attached backtrace and locks output?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira