[asterisk-dev] RTP streams suddenly stop
Tony Mountifield
tony at softins.clara.co.uk
Wed Feb 24 07:36:01 CST 2010
(Deliberate top-post; please see below for the original description of
the problem, posted on 4 Feb)
Well, it turns out that restoring the calls to CHECK_BLOCKING etc., that
I had previously omitted, did not fix the problem, as it has recently
occurred again.
I would really, Really, REALLY appreciate some helpful comments from
those people here who are expert in the relevant parts of the code, most
likely channel.c and/or chan_sip.c.
Let me summarise the latest occurrence, and some interesting facts.
At 12:20:46 yesterday, all twelve outbound RTP streams suddenly stopped.
Ten of them were in Meetme conferences, and two of them were just
listening to MoH. Although the ten Meetme participants gave up after a
while and hung up, the two calls in MoH were left running. They were
direct connections to SIP phones, not to an ITSP.
At almost exactly 5 minutes later, the two streams to the SIP phones
resumed.
So the first question is: what in the code might have a timeout of
exactly 5 minutes, that might hold up all RTP streams until it times
out? Presumably some kind of lock.
Examining the packets before and after the pause was interesting. The
last packet before the pause was sent at 12:20:46.895666, and had a
SEQ of 19079, and a TS of 117477992. The next packet seen by my monitor
on that stream was sent at 12:25:46.883298, and had a SEQ = 19080, and a
TS of 119877888. It's interesting that the TS was almost 5 minutes
later, but that the SEQ was consecutive from the previous one. I'm
hoping that might give both me and the many others who reply to this (!)
a clue as to where to look.
As I mentioned before, during the 5-minute hiatus, new RTP streams would
only last a single packet before stalling.
I will be scouring the code this afternoon, but if anyone has any good
ideas, I would be very grateful indeed. I'm sure the 5 minutes is a
significant piece of data.
Cheers
Tony
In article <hkeimd$kie$1 at softins.clara.co.uk>,
Tony Mountifield <tony at softins.clara.co.uk> wrote:
> I'm posting this to asterisk-dev because I am certain that I will need
> to dive into the code to identify and fix this problem. However, I'm not
> yet sure where to look, so would be grateful for some ideas!
>
> The system in question talks SIP to an ITSP and is installed on their
> LAN in a colocated rack. It also has a couple of SIP phones dial into it
> over the Internet.
>
> Every so often (once every week or two), the users complain of being cut
> off, and when dialling back in cannot hear any audio. This persists for
> several minutes before magically fixing itself.
>
> I have been running a continuous SIP trace using tcpdump -w for offline
> analysis, but this didn't show anything useful. I decided it was not
> practical to take tcpdump traces of the RTP streams, due to the volume
> of data involved.
>
> So I wrote a monitor for the RTP streams using libpcap, which would keep
> track of when the RTP streams started and stopped, and also look for
> anomalies in the timestamps and sequence numbers.
>
> The problem occurred again this morning, and what the monitor showed me
> was this:
>
> a) At about 300ms after 10:31:35, all the active RTP streams from
> asterisk to the ITSP and the two SIP phones stopped simultaneously. The
> streams into Asterisk continued, and the ITSP started sending RTCP
> enquiries.
>
> b) For the next five minutes, as people tried calling, every stream out
> of Asterisk lasted exactly one packet before stopping.
>
> c) After about five minutes, the problem magically fixed itself and
> audio streams started to flow normally.
>
> There is nothing in /var/log/asterisk/full at the times in question; it
> has no entries between 10:31:18 and 10:31:44. I can't see anything
> relevant in the rest of the system either.
>
> So, why would all RTP streams stop at once?
>
> The underlying OS is CentOS 4.7.
>
> Zaptel is 1.2.27 with ztdummy compiled with USE_RTC.
>
> The version of Asterisk is 1.2.32 with some custom modifications. The
> most relevant modification might be that I have added the internal
> timing feature from https://issues.asterisk.org/view.php?id=5374
> However, I have included this in all systems for the last four years
> without any trouble till now.
>
> Some advice on this would be REALLY welcome, as I must fix it urgently.
>
> Cheers
> Tony
> --
> Tony Mountifield
> Work: tony at softins.co.uk - http://www.softins.co.uk
> Play: tony at mountifield.org - http://tony.mountifield.org
--
Tony Mountifield
Work: tony at softins.co.uk - http://www.softins.co.uk
Play: tony at mountifield.org - http://tony.mountifield.org
More information about the asterisk-dev
mailing list