[asterisk-dev] RTP streams suddenly stop

Wed Feb 24 10:48:13 CST 2010

Tony Mountifield wrote:
> (Deliberate top-post; please see below for the original description of
> the problem, posted on 4 Feb)
> 
> Well, it turns out that restoring the calls to CHECK_BLOCKING etc., that
> I had previously omitted, did not fix the problem, as it has recently
> occurred again.
> 
> I would really, Really, REALLY appreciate some helpful comments from
> those people here who are expert in the relevant parts of the code, most
> likely channel.c and/or chan_sip.c.
> 
> Let me summarise the latest occurrence, and some interesting facts.
> 
> At 12:20:46 yesterday, all twelve outbound RTP streams suddenly stopped.
> Ten of them were in Meetme conferences, and two of them were just
> listening to MoH. Although the ten Meetme participants gave up after a
> while and hung up, the two calls in MoH were left running. They were
> direct connections to SIP phones, not to an ITSP.
> 
> At almost exactly 5 minutes later, the two streams to the SIP phones
> resumed.
> 
> So the first question is: what in the code might have a timeout of
> exactly 5 minutes, that might hold up all RTP streams until it times
> out? Presumably some kind of lock.
> 
> Examining the packets before and after the pause was interesting. The
> last packet before the pause was sent at 12:20:46.895666, and had a
> SEQ of 19079, and a TS of 117477992. The next packet seen by my monitor
> on that stream was sent at 12:25:46.883298, and had a SEQ = 19080, and a
> TS of 119877888. It's interesting that the TS was almost 5 minutes
> later, but that the SEQ was consecutive from the previous one. I'm
> hoping that might give both me and the many others who reply to this (!)
> a clue as to where to look.
> 
> As I mentioned before, during the 5-minute hiatus, new RTP streams would
> only last a single packet before stalling.
> 
> I will be scouring the code this afternoon, but if anyone has any good
> ideas, I would be very grateful indeed. I'm sure the 5 minutes is a
> significant piece of data.
> 
> Cheers
> Tony
> 

The most common timed operations in Asterisk are calls to poll(2). I'm not 
really sure why this would cause a system-wide block to all threads though. If 
you were using Asterisk 1.4 or higher, I would recommend executing "core show 
locks" at the CLI while the problem is happening. That way you could see if 
there is some thread holding onto a lock and going into a blocking operation for 
  five minutes. As it is, your best bet is to look at a backtrace of all threads 
when the problem occurs. Locate threads that should be sending RTP but are not. 
Are they currently attempting to acquire a lock? If so, then find a thread which 
is currently in a poll or select system call and see if it is holding the lock 
that everyone else is attempting to acquire.

Also, I know that you are probably frustrated that you haven't gotten the volume 
of responses you would like on the list, but I think there are two reasons for 
that. One is that the questions you ask are sometimes so complex that people 
simply just don't know the answer to them (i.e. the CHECK_BLOCKING question you 
had at one point). The other is more specific to this case, which is that you 
are using a version of Asterisk which is, aside from security updates, 
unmaintained. There have been thousands of bugs fixed in 1.4 and higher versions 
that have not been backported to 1.2, so it's incredibly difficult to pinpoint 
whether this problem is one that still exists in current code and needs to be 
addressed or if this is something that was fixed years ago in current branches.

Mark Michelson