[Asterisk-Dev] IAX I-D comments

Wed Apr 27 15:03:05 MST 2005

Firstly, _THANKS SO MUCH_ for starting this!  I think it's great, and I 
think that keeping this up-to-date (and requiring that we update this 
specification _first_ before changing or adding things in the code) will 
be a great help to implementors, and can help improve the process greatly.

- My take on the IAX2 vs IAX business:  I think the protocol should be 
called IAX2 or IAX version 2.0 or something.  Just because IAX(1) has 
been deprecated for some time doesn't mean it's not still out there;  
There are linux distributions still shipping libiax, for example.

- IAX can use a well-known port, but it is not a requirement; several 
introductory paragraphs (which, I imagine, are non-normative), seem to 
imply that it is.

- "The bandwidth efficiency for other stream types is sacrificed for the 
sake of individual voice calls."  I'm not sure I would agree with this;  
it's uses equally low overhead for video streams, images, URLs, etc. 

- "Meta frames are used for call trunking or video stream 
transmission."  (XXX check.)

- Security:  unencrypted IAX is subject to a variety of DoS attacks, at 
the very least.  (It should be trivial to send INVAL and kill sessions, 
for example).

- "Full frames are sent reliably, so all full frames require an 
immediate acknowledgment upon receipt."  Actually, it ought to work 
similar to TCP, where an acknowledgement may be send up to 1RTT after 
receipt (delayed ACK), to enable more ACKs to be sent piggybacked 
(implicitly).   The delayed ack functionality should be described as a 
SHOULD in the specification, since it is OK (but not optimal) for a peer 
to send immediate, explicit ACK frames whenever it receives a full frame 
(libiax2 does this presently).

- "This 15-bit value specifies the call number the transmitting
      client uses to identify this call.  The source call number for an
      active call MUST not be in use by another call on the same 
client."  -- should probably say "from the same peer" instead of "on the 
same client" -- generally, we should avoid using "client" and "server", 
I think.

- "IAX does not specify a retransmit timeout; this is
      left to the implementor." --  we should probably specify with a 
SHOULD how these timers should work.

- Timestamp
      The Timestamp field contains a 32-bit timestamp maintained by an
      IAX peer for a given call.  The timestamp is an incrementally
      increasing representation of the number of milliseconds since the
      first transmission of the call.

   Based on my experience, we really ought to write a bunch more about 
timestamps, and how they work, as they are one of the trickiest areas to 
get right in an implementation..  I can definitely help with this.

- Mini Frames are so named because their header is a minimal 4 octets.
   Mini frames carry no control or signaling data; their sole purpose is
   to carry a media stream on an already-established IAX call.  They are
   sent unreliably.  This decision was made because VOIP calls typically
   can miss several frames without significant degradation in call
   quality while the incurred overhead in ensuring reliability increases
   bandwidth requirements and decreases throughput.  Further, because
   voice calls are typically sent in real time, lost frames are too old
   to be reintegrated into the audio stream by the time they can be
   retransmitted.

    Actually, mini frames can carry only audio stream data, not the 
"media stream"; it would help to clarify this a bit.  I think I might 
skip the whole discussion of why they are sent unreliably -- but that's 
just my opinion (well, this whole reply is mostly just my opinion..).

- Timestamp
      Mini frames carry a 16-bit timestamp, which is the lower 16 bits
      of the transmitting peer's full 32-bit timestamp for the call.
      The timestamp allows synchronization of incoming frames so that
      they may be processed in chronological order instead of the
      (possibly different) order in which they are received.  The 16-bit
      timestamp wraps after 65.536 seconds, at which point a full frame
      SHOULD be sent to notify the remote peer that its timestamp has
      been reset.  A call must continue to send mini frames starting
      with timestamp 0 even if acknowledgment of the resynchronization
      is not received.

There's some sublety here that comes into play when DTX (discontinuous 
transmission) happens.   Example:

  o You're going along, sending mini frames from the beginning of the 
call, for 30 seconds, and then you stop sending audio.
  o 5 minutes pass, while you're not sending audio (perhaps you're just 
listening silently to a conference call, waiting on hold, etc.).
  o You then begin sending audio again; possibly sending a FULL voice 
frame, and then miniframes.

In this case even if you send the full frame first, the receiver might 
not receive it before it receives the next miniframe(s).  In that case, 
if the only means of updating the top 16 bits of the receiver's idea of 
your timestamps is FULL voice frames, it's going to totally blow up 
reconstructing the timestamps on your miniframes.

For this reason, what I've done in iaxclient/libiax2, and what I have a 
patch in mantis to do, is to update the top 16 bits on all full frames, 
and ensure that PING frames are sent every 10-30 seconds, in order to 
ensure that when coming out of a silent state like this, the timestamps 
on miniframes can be appropriately reconstructed.

- Mini frames are implicitly defined to be of type 'voice frame'
   (frametype 2; see Section 6).  The subclass is implicitly defined by
   the most recent full voice frame of a call (i.e.  the subclass for a
   voice frame specifies the codec used with the stream).  The first
   voice frame of a call should be sent using the codec agreed upon in
   the initial codec negotiation.  On-the-fly codec negotiation is
   permitted by sending a full voice frame specifying the new codec to
   use in the subclass field.

I think some note is in order to describe the condition which can occur 
when a codec changes, but mini frames (with the new codec) arrive before 
the full frame which specifies the new codec.  In practice, I think that 
the decoders will either (a) ignore the invalid data, or (b) produce a 
short burst of jibberish when they receive it.

-   Command Data
      This 8-bit field specifies flags for options which apply to a
      trunked call.  The least significant bit of the field is the
      'trunk timestamps' flag.  A value of 0 indicates that the calls in
      the trunk do not include their individual timestamps.  A value of
      1 indicates that the calls do each include their own timestamp.
      All other bits are reserved for future use.

Because the only presently defined states for this field are 0x00 and 
0x01, we could define the field to be either a bitmap (as you have), or 
an 8 bit integer.  I'm not sure it matters until other things use this 
field, though.

-   Timestamp
      Meta trunk frames carry a 32-bit timestamp, which represents the
      actual time of transmission of the trunk frame.  This is distinct
      from the timestamps of the calls included in the trunk.

I think "actual time of transmission" should be replaced with "number of 
milliseconds since the beginning of the trunk session" or something like 
that.  "actual time" seems to imply some relation to the time of day.

- IAX allows multiple media exchanges between the same 2 peers to be
      multiplexed into a single trunk call.  This decreases bandwidth
      usage, as there are fewer total packets being transmitted. [...]

I'd add which decreases the amount of overhead due to UDP, IP, and 
underlying protocols (because otherwise, if you just look at the IAX 
layer and above, trunking actually uses more bits).

I think this description of trunking might better go right before the 
description of the wire protocol for trunk frames?

Also, it should be clarified that there is _no_ negotiation in the 
protocol for whether to use trunking, or the particular trunk mode, and 
this must be done out-of-band (although, we could add something for 
peers to advertise their trunking support in some kind of 
IAX-capabilities IE at the beginning of the call).  Presently, some IAX 
implementations support trunking (chan_iax2), and some do not (libiax2), 
while only CVS-HEAD supports trunk timestamps.

- 6.10  Comfort Noise Frame

   The frame carries comfort noise.

   The subclass is the level of comfort noise in -dBov.

Hmm, In this case, dead silence should be maxint, right?  I've been 
sending zero, which is a lot of noise :)

We should also specify that, after sending audio data,  implementations 
SHOULD (or maybe MUST) send a Comfort Noise Frame to indicate the end of 
a sequence of voice frames..  (although, asterisk presently does not 
comply with this).

|  0x12  |  18  |  VNAK       |   Video/Voice retransmit 
request            |

Is this correct?  I thought VNAK was sent when a full frame is received 
before a preceeding full frame (i.e. when you're expecting sequence 
number 2, and you receive something > 2).  Yup, that's what seems to 
happen as I see it in the code..

- (ACK)
      the sequence number counters, and return the same timestamp it
      received.  This allows the originating peer to determine to which
      message the ACK is responding.  Receipt of an ACK requires no
      action.

Presently, while (both) implementations of IAX put the same timestamp in 
ACK packets that they receive, it serves no real purpose to do so;  the 
sequence numbers in the ACK packet actually do the acknowledgement.    
It would require less special-case code if ACK packets actually sent the 
acker's timestamp instead of the senders' timestamp.  The PONG frames 
which return the senders timestamp are useful, and are used to calculate 
the round-trip-time of the network.

This relates to the beginning with ACKs and stuff:  All full frames must 
be acknowleged _either_ by an explicit ACK, _or_ by an implicit ack in 
another full frame, and the acknowledgement (implicit or explicit) 
should be sent within some timeframe (1RTT or something we should 
determine).

8.11  LAGRQ

      A LAGRQ is a lag request.  It is sent to determine the lag between
      2 IAX endpoints, including the amount of time used to process a
      frame through a jitterbuffer (if any).  It requires a clock-based
      timestamp, and must be answered with a LAGRP, which must echo the
      LAGRQ's timestamp.  The lag between the 2 peers can be computed on
      the peer sending the LAGRQ by comparing the timestamp of the LAGRQ
      and the time the LAGRP was received.

I'd say we should really just deprecate LAGRQ;  The present 
implementation (or, last I looked) tried to send the LAGRQ through the 
jitterbuffer on one end, and then the LAGRP through the jitterbuffer on 
the other end.  This often really broke things, because the LAGRP has 
the wrong end's timestamp, and therefore, if the clocks between both 
sides have skewed, just gave you nonesense results.  Using the RR IE's 
in PONGs is probably a better way to get the same information.

I think if we marked it as deprecated, and said that compliant 
implementations should not send LAGRQ, and should acknowledge and then 
ignore them if they are received, nothing would break (because it's just 
used for display in iax2 show channels, and even there, that command 
would show zero lag if it never received LAGRP.

-   Protocol-Defined Information Elements:

It would be super convenient if this table also included the datatype 
for each IE (i.e. uint8_t, uint16_t, string, etc..).  OTOH, just looking 
at iax2.h is easy enough :)

   |  0x2f  |  47  |  RR LOSS          |   Received loss, as in 
rfc1889                 |

It's important to mention that this IE actually contains two integers, 
the first byte is a short-term loss percentage, and the final low 24 
bits loss count  (nevermind, I see you get to that later -- cool :)

- RR DELAY

           The purpose of the RR DELAY information element is to
           indicate the maximum playout delay for a call, per
           rfc1889[3].  The data field is 2 octets long and specifies
           the number of milliseconds a frame may be delayed before it
           must be discarded.

Actually, RR_DELAY indicates the maximum playout delay that a frame 
received by the peer is likely to experience before playout.   (I'm not 
sure if this is in rfc1889, some of the other RRs aren't either;  I 
added them either because I though they'd be useful, or one of the few 
people who commented on this stuff on asterisk-dev did).

RR_DELAY is useful, because when you've received it, you can take 
RR_DELAY, add it to RTT, and get a good upper bound is on the delay 
between when audio is sent out to the network, to when it's rendered at 
the other end.

| 0x00000002 | GSM Full Rate    | 33 byte chunks of 160 samples or |
   |            |                  | 65 byte chunks of 320 samples    |

Is this actually valid (sending MS-GSM 65 byte stuff)?

 | 0x00000010 | G.726            |                                  |
   +------------+------------------+----------------------------------+
   | 0x00000020 | IMA ADPCM        | 1 byte per 2 samples             |

G.726 is also 1 byte per 2 samples..

   | 0x00000100 | G.729            | 20 bytes chunks of 172 samples   |
G.729 is 20 bytes per 160 samples, no?

I'll have more comments later, I suspect, and also perhaps some additions.

-SteveK