[asterisk-dev] RTP streams suddenly stop

Tony Mountifield tony at softins.clara.co.uk
Wed Feb 24 11:37:46 CST 2010


Mark, thank you for your reply - much appreciated! See below...

In article <4B85584D.5030408 at digium.com>,
Mark Michelson <mmichelson at digium.com> wrote:
> Tony Mountifield wrote:
> > (Deliberate top-post; please see below for the original description of
> > the problem, posted on 4 Feb)
> > 
> > Well, it turns out that restoring the calls to CHECK_BLOCKING etc., that
> > I had previously omitted, did not fix the problem, as it has recently
> > occurred again.
> > 
> > I would really, Really, REALLY appreciate some helpful comments from
> > those people here who are expert in the relevant parts of the code, most
> > likely channel.c and/or chan_sip.c.
> > 
> > Let me summarise the latest occurrence, and some interesting facts.
> > 
> > At 12:20:46 yesterday, all twelve outbound RTP streams suddenly stopped.
> > Ten of them were in Meetme conferences, and two of them were just
> > listening to MoH. Although the ten Meetme participants gave up after a
> > while and hung up, the two calls in MoH were left running. They were
> > direct connections to SIP phones, not to an ITSP.
> > 
> > At almost exactly 5 minutes later, the two streams to the SIP phones
> > resumed.
> > 
> > So the first question is: what in the code might have a timeout of
> > exactly 5 minutes, that might hold up all RTP streams until it times
> > out? Presumably some kind of lock.
> > 
> > Examining the packets before and after the pause was interesting. The
> > last packet before the pause was sent at 12:20:46.895666, and had a
> > SEQ of 19079, and a TS of 117477992. The next packet seen by my monitor
> > on that stream was sent at 12:25:46.883298, and had a SEQ = 19080, and a
> > TS of 119877888. It's interesting that the TS was almost 5 minutes
> > later, but that the SEQ was consecutive from the previous one. I'm
> > hoping that might give both me and the many others who reply to this (!)
> > a clue as to where to look.
> > 
> > As I mentioned before, during the 5-minute hiatus, new RTP streams would
> > only last a single packet before stalling.
> > 
> > I will be scouring the code this afternoon, but if anyone has any good
> > ideas, I would be very grateful indeed. I'm sure the 5 minutes is a
> > significant piece of data.
> > 
> > Cheers
> > Tony
> > 
> 
> The most common timed operations in Asterisk are calls to poll(2). I'm not 
> really sure why this would cause a system-wide block to all threads though. If 
> you were using Asterisk 1.4 or higher, I would recommend executing "core show 
> locks" at the CLI while the problem is happening. That way you could see if 
> there is some thread holding onto a lock and going into a blocking operation for 
>   five minutes. As it is, your best bet is to look at a backtrace of all threads 
> when the problem occurs. Locate threads that should be sending RTP but are not. 
> Are they currently attempting to acquire a lock? If so, then find a thread which 
> is currently in a poll or select system call and see if it is holding the lock 
> that everyone else is attempting to acquire.

Yes, this is the conclusion I have been coming to. I found the gstack(1) script
for doing a backtrace of all threads in a running process. It's not unobtrusive
to the running of the process, but when Asterisk is in the affected state, it
can't make things worse than they are. I'll try to enhance my monitor program
to detect this stalled condition and invoke a gstack at that time.

I've been looking through the code trying to find somewhere that could stall
the RTP streams but still allow other operations (for example SIP call setup
and cleardown continue to work, but callers hear silence; AMI still works too).
I did wonder about the SIP pvt lock in ast_rtp_write(), but then discovered
that that is a per-channel lock, not a SIP-wide lock. And that there is nothing
down in ast_rtp_raw_write() that would block, since the socket was non-blocking.

So it makes me wonder whether the issue is above ast_write(), not below.

> Also, I know that you are probably frustrated that you haven't gotten the volume 
> of responses you would like on the list, but I think there are two reasons for 
> that. One is that the questions you ask are sometimes so complex that people 
> simply just don't know the answer to them (i.e. the CHECK_BLOCKING question you 
> had at one point). The other is more specific to this case, which is that you 
> are using a version of Asterisk which is, aside from security updates, 
> unmaintained. There have been thousands of bugs fixed in 1.4 and higher versions 
> that have not been backported to 1.2, so it's incredibly difficult to pinpoint 
> whether this problem is one that still exists in current code and needs to be 
> addressed or if this is something that was fixed years ago in current branches.

Yes, I take your points. Although when I find an issue like this, and post a
complex query, I'm looking for pointers and ideas, more than solutions, as I'm
sure there must be people, at least at Digium, who understand things like
CHECK_BLOCKING more than I do. I also use svnview and comparisons of local trees
to see what relevant things might have changed in later versions or in trunk.

My main question at the moment is "what mechanism could stall all RTP streams?",
and a clue is the almost-exact five minutes for which it happens.

Thanks again
Tony
-- 
Tony Mountifield
Work: tony at softins.co.uk - http://www.softins.co.uk
Play: tony at mountifield.org - http://tony.mountifield.org



More information about the asterisk-dev mailing list