[Asterisk-video] IAX timestamps, presentation vs transmission time Re: [asterisk-dev] Video packetization proposal

Mon Jun 4 13:11:58 MST 2007

On 6/4/07, Steve Kann <stevek at stevek.com> wrote:
> Peter Grayson wrote:
> >
> > You are right that the obvious conversion would be trivial. However,
> > there is a semantic difference between RTP and IAX. The RTP timestamp
> > is the presentation time for the media. The IAX timestamp is the
> > transmission time. Presentation time and transmission time may or may
> > not be related. I really don't know. For asterisk, the transmission
> > time is important for dejittering the packet stream. There seems to be
> > an implicit assumption that presentation time is "upon receipt" which
> > is different than "when the packet says".
> I think that we can define this more clearly and come up with a workable
> definition.  In both cases, these times are relative, are they not?  And
> if they are relative, does it matter what their difference is, as long
> as it is constant?

They are relative, but I'm not sure if their difference is really
constant. For example, a sender might determine the presentation time
of a video frame based on the sampling time from the video capture
system. This would be very accurate. On the other hand, the
transmission time may accumulate a lot of variable delay from the
sender encoding the frame. The encoding stage has a very high
potential to take non-constant amount of time. There could be some
other stages besides encoding in the pipeline between capture and
transmission that further diverge the nominal presentation time and
the transmission time.

This problem may further exacerbated by entities like app_conference
that reset outgoing packets' timestamps.

> In iaxclient, as well as in most media presentation systems for
> real-time media, the goal is not to present frames "upon receipt", but
> as soon as possible, honoring as best as possible the inter-frame
> spacings.  This is the job of the jitterbuffer.

This kinda gets right to the point: how does the client know the
"real" presentation time of any particular video frame? In the case of
audio, the jitterbuffer has knowledge of the nominal inter-frame
spacing (20ms) and ensures that the exact audio frame rate is ensured.
Thus for audio the presentation time is as soon as the jitterbuffer
releases it.

With video, the situation is different. We don't worry about the video
output buffer getting under-run like we do the audio output buffer. So
the jitterbuffer doesn't know about the nominal video frame rate and
as a consequence it release video frames according to their
transmission times. This may serve as an opportunity for audio and
video to be presented out of sync.

So even if we decided to use a 90kHz presentation time oriented
timestamp, there would still be a disparity between how audio and
video frames are stamped. I suppose this is a strong argument to stick
with the transmission time oriented timestamps and do whatever we can
to fudge the iax <--> rtp translations as accurately as possible.

> In RTP, there may be a mechanism (I don't remember how this works) for
> transmitting frames with timestamps which don't (relatively) follow
> real-time -- this is used more in RTSP than in SIP, where you can ask an
> RTSP server to send you media, but send it at a multiple of real-time
> (2x as fast, 1/2 as fast), and Quicktime uses this in some cases
> (probably both for buffering, and for fast/forward, rewind, and other
> controls).
>
> >
> > Also, the RTP timestamp uses a 90kHz clock. The IAX timestamp is
> > measured in milliseconds which is effectively a 1kHz clock. There is a
> > rather large difference in precision between these two thus
> > information would be lost in RTP to IAX mappings and IAX obviously
> > does not have sufficient information to match RTP's precision in the
> > IAX to RTP mapping case. Does this matter? Seems like it is worth
> > consideration.
> That's a good question.  I think that it wouldn't make much of a
> difference, if a frame was actually presented +- 1/2 msec from another
> frame (in the video case), and in the audio case practically all codecs
> use integral numbers of milliseconds to sample.
>
> RTP already has to deal with this for the NTSC case (29.97fps with the
> 90khz clock).  IAX will need to deal with this for 15fps or 30fps, which
> _would_ divide neatly into a 90khz clock, but don't divide neatly into 1khz.

Pete