[Asterisk-Users] FXO/FXS cpu spikes, data loss and ztclock.

qrss qrss at keitz.org
Tue Jun 21 13:06:58 MST 2005


>It might be possible to change the values slightly to judge their impact.
>I've not done the math, so not sure if changing the values has any real
>merit.

Yes, I think it does.  I'm definitely going to try some tweaks there.
For anybody interested, the reference document that I am using can
be found here:

http://www.silabs.com/public/documents/tpub_doc/dsheet/Wireline/Silicon_DAA/en/si3035.pdf

Pages 22-24 demonstrate how the sampling rate is determined for the DAA.

>I'm not a proficient programmer at all, but some experienced programmers
>use various profiling tools to help understand which routines are consuming
>cycles. It would seem like that could be used to help isolate the
>repetitive cpu spikes.

Yes, that sounds right on. Perhaps I'll try a little googling around to
see if anything looks useful.

>Its my understanding (which could be incorrect) the clock on the TDM card
>is used for two purposes. First to drive the onboard chipset and second
>to generate an interrupt on a recurring basis. And, that same interrupt is
>used to "time" or "sync" other functions within asterisk. At least that
>has been the argument behind "do you have a zaptel timing device". Each
>seems the TDM card is the only card that leaves something on the table.

I agree.  It looks to me like the Si3035 uses the crystal to provide
a clock source that determines the rate at which it samples the analog
data and converts it to digital data in a serial format.  In theory,
you want the sample rate to be precisely 8000 samples per second just
like a T1. The serial data samples appear to be passed on to the TJ320
chip which buffers and collects 8 samples per channel and then bus masters
that whole chunk of 8 buffered samples across the PCI bus using DMA and
tucking it away into RAM somewhere. The TJ320 then signals the interrupt
to request that the driver service that 8 sample chunk of data.  Since it
is working with 8 sample chunks, there are only 1000 interrupts generated
per second.  Now if the zaptel driver has been configured to use this card
as it's primary clock source, then it will rotate and clock data around
at the same rate.  The clocking of the zaptel driver actually is derived
from the rate at which it receives data from the TJ320. (or whatever other
hardware source)  This is much the same idea of how a CSU derives it's
clock source from the telco network on a T1.

>So, is the missed data resulting from:
> 1. pcm data arriving to fast/slow on the card for the pci controller to
>    cause an interrupt and transfer the data across the bus reliably?

I don't think this is the case based on the results of ztclock. Assuming
that ztclock is accurate, it (and zttest for that matter) seem to report
in most cases that samples are actually being received faster than the
8000/second or 8192/1.024 second expectation.  In my measurement...

483328 samples in 60.410900 sec. (483288 sample intervals) 99.991722%

...it looks like the card handed 483328 samples across the pci bus in the
amount of theoretical time that it should have taken to get only 483288
samples.  That's 40 more samples than expected in theory.  Now since the
card actually transfers 8 sample chunks per interrupt you can derive
that the extra chunks came in at a rate of 5 times over 60.416 seconds
of measuring time.  That works out to 1 eight sample chunk every 12.0832
seconds.  Coincidentally enough, this is almost precisely the timing of
my CPU spikes.

> 2. to much time spent handling the interrupt within asterisk drivers
>    causing an interrupt to be missed (or delayed service)?

This seems more in line with what I'm suspecting. Possibly too much time
being spent in the interrupt routines when this extra data chunk arrives
and possibly overwrites data in a buffer that has not yet been serviced.
I'm not convinced it's the interrupt routine's fault however. Could be
that clock sync is the real culprit.  Take a couple of CSUs connected
across a T1 circuit that does not provide telco clocking...

  |CSU A|----------|CSU B|

... If both are set for internal clocking, they may
work for a period of time, but invariably the individual
clocks will get out of sync and frame slips are likely to
occur.  One CSU would be seeing it's buffers being overrun,
the other, underrun periodically.  The problem itself is not
in the way that the buffers are handled in either CSU (I liken
the zaptel driver to be much in the same category of a CSU but
performing additional tasks.) The problem is simply unsyncronized
clocking.  If either side is set to clock from the network (effectively
timing from the other guy's clock) it will
fix the syptoms of the problem (slips) on both sides. Could be the
same scenario here.

> 3. timing design conflicts between clocking the 3050 (pcm conversation)
>    verses interrupt requirements?

It would seem that the DAA clocking is directly related to the interrupt
time with a 8:1 ratio.  That would seem to be the design objective, but
I don't think that we can rule out the possibility that something
is wrong with the hardware.

> 4. potential problems in the pci controller design?

Seems that the hardware is working correctly, but I'd still be
reluctant to rule something like this out.

>I would have to believe the clock is driving the pcm encoding function
>within the 3050 chip, and the design objective is to cause the chip to
>encode exactly 8,000 samples per second. Therefore, changing that
>clocking mechanism is likely to generate 7,990 or 8,010 samples (or
>some other non-standard rate) that is likely to negatively impact other
>asterisk functions (due to the reliance on the interrupts as a timing
>source). But, the flip side of that would suggest the existing design
>is running at some rate other then 8,000 samples/sec now.

This would seem likely.  Even an expensive CSU working from internal timing
cannot match nearly the accuracy of the clock sources used by the telco.
That's
why you get timing slips when your CSU is not set to clock from the network.
Seems impossible that we will ever see a perfect clock source from a small
crystal oscillator tank circuit, but the closer it is, the less slips and
the better our chances of running a data application over a certain period
of time.  Even a stratum 1 clock will slip once every 4 months or so. Here
though, I think we are seeing slips every 2 or 3 seconds on average.

One other important point - as long as everything (all channels) are clocking
off of the SAME imperfect clock - everything should be fine.  I am not
convinced however that Asterisk is really doing this.  It's really only
when data arrives and departs from two different clock sources that slips
and data loss occur.  That's why I'm wondering about the channel bank idea.
Say 12 FXO and 12 FXS ports all timing from the channel bank and then zaptel
using that channel bank as a timing source in a stand alone system.  I'm
thinking that kind of setup should be able to bridge data calls without
timing slips through the asterisk system. The key there is that all ports
on the channel bank are actually using a common hardware clocking line.
Whether it's fast or slow, all channels are still in sync.

>For the TDM card, there is no such thing as syncing its clock to anything
>since its handling incoming analog audio that contains no such info.

Agreed. However, the card generates and samples at a clock rate determined
by it's oscillator.  Asterisk then uses that clock to move data across
bridged
channels, conferences even VOIP.  In order to not have any data loss, all of
these channels would have to clock from the same source in my opinion. It
seems that they do not which I suspect is the root cause of the meetme
conference delay when bridging calls from VOIP channels.

>I believe that is correct and was very likely one of the driving forces
>in the design of the TDM card (e.g., one interrupt handling four pstn
>lines as opposed to multiple x100p cards each with their own interrupt
>servicing requirements.

That seems logical to me. However, if zaptel is timing from one TDM400p
can a second TDM400P derive it's timing either directly from the first
or from zaptel? If not, then I cannot imagine being able to bridge data
across 2 TDM400P cards without guaranteed data loss.

>I don't believe anyone has confirmed the cpu spikes are actually
>responsible for missed frames. At least I won't assume that for now.

>The T1 card is different since a properly configured card will sync its
>onboard clock with an external source that is considered highly accurate.
>When the clock is in sync, there is no such thing as missed pcm frames
>on a T1 card. But, I'm sure you're read the various postings from folks
>that did not properly define the card sync and those postings generally
>relate to audio clicks (and other disturbances) that are essentially the
>same apparent issues as a free-wheeling TDM clock.

Exactly. In my assessment, the zaptel driver when properly configured
uses the network (T1 carrier) timing to clock it's data around the
system from channel to channel.  However, if we add a TDM400P with say
4 FXS ports to the mix, does the TDM400P then derive it's clock either
directly from the T1 card or from the zaptel driver?  If not, then
data loss is almost a guaranteed certainty due to frame slips.

>It's my opinion (which also could be incorrect) that running vmstat and
>ztclock is simply pointing out a symptom, and are probably not the right
>tools to identify the root cause. Note the same symptom exists on an
>600 mhz mobo as compared with a 3.0 ghz mobo, therefore the root cause
>appears to be more related to something happening after xxxx frames.

I think that the something happening after xxxx frames is a slip or buffer
overrun/underrun.

Here was my mode of thinking with that. If ztclock is accurately measuring
the clock, then it is possible to actually determine the slip rate while
meshing data against a theoretically true clock source.  T1, or perhaps
even VOIP for example.  If these clocks are out of sync, then data loss
would naturally result from frame slips. In fact, I might even expect to
see just the sort of CPU spike that we are seeing during such an event
as the code attempts to deal with either lost or overwritten data buffers.
I would expect for example that the code was written to assume perfect
timing sync.

The more folks on different hardware that try ztclock and vmstat 1 and
accurately predict their rate of cpu spikes, the more sure we can be that
we are truely chasing a frame slip scenario. Also if accurate, the test
should show 100% accuracy against a properly configured T1 with no frame
slip calculation.  It should also be able to determine the number of frame
slips that would occur if using a misconfigured T1 card connected to the
telco.  Also, an improperly configured T1 should also generate CPU spikes
under certain conditions if we are chasing the right problem in my opinion.
I'd like to get to a zttest/ztclock that when run can tell us things like
your timing source is definately off, or you are definately running a slow
clock and possibly missing interrupts etc.  In my opinion, zttest was not
cutting it for that purpose. It seemed to me that only a slightly slow
clock would show 100% because the test seemed to assume that it took
a whole 125 uSec clock tick to bring in a sample of data which according
to the TJ spec sheet seems not to be true. In fact, it appears that
32 bits is clocked in about 1/3 of that cycle.  Also, I didn't feel that
the time resolution was appropriate to detect real slips which might
not occur for several seconds, minutes etc. So I stretched the test out
to account for that.  It was only after those changes that I was able
to see any direct relationship between the cpu spikes and the clock speed.

Overall however, I am aware of the fact that all these tests were perfomed
and calculated on only my hardware with my kernels, etc. and for that reason
could be completely invalid in terms of the true problem. I'm also aware that
I could be seeing less of "the truth" and more of what "I'm looking for."
I'd really like to have other people look at the relationship to determine
if we are really onto the right track here.  All of my theory is based
upon the accuracy of ztclock.

>Steve Underwood has made the comment that spandsp "did" work at one time
>on the TDM card, and spandsp is probably "the" most critical software
>that is dependent on absolutely no missed frames. If that is correct,
>that implies the problem is most likely associated with the zaptel
>drivers as those same original TDM cards don't work now (and obviously
>nothing has changed on those installed cards).

Could be that perhaps somehow we introduced other timing sources to
zaptel that were not present before.  VOIP channels? Meetme conferencing
perhaps? - Just guesses really.   However for frame slips to occur, data
must arrive from two different clock sources.  The conditions for this
to occur certainly seem right.

>Also, the TDM card has gone through several hardware revisions
>indicating the original design had multiple design short-comings.
>Considering that card was one of the first designs to be marketed
>by digium, and considering that T1 interface cards are much easier
>to design (since there is no analog component involved), it's highly
>probable the drivers do not exactly match the card's design objectives
>(left-hand right-hand scenario).

That could well be the case.  I'm curious what they have been changing
with so many revisions?  Could it be that they are playing with the
clock?  Improving component tolerences? Are they changing some type
of firmware?  Is there a change log?  The hardware manufacturers that
supply the DAA give us change revisions to the reference design right
in their data sheet. Does Digium provide anything similar?

At the end of the day, I'd like to be able to run multiple FXO / FXS
interfaces with data integrity. I've got a hunch that this would work
with a channel bank interfaced to Asterisk via T1 card.  The
timing theory seems to add up OK. I have serious reservations about expecting
the TDM400P to handle the task even though it is advertised as a channel
bank replacement.  On the digium website they say, "The TDM400P takes
the place of an expensive channel bank and brings the system price point
to a low level."  That's exactly what I'm looking for in some cases.
Problem is, if it can't pass modem/fax calls, I'll need something else
at whatever price.  A hardware PBX certainly can do it, but I'd really
like to see it work with Asterisk.





More information about the asterisk-users mailing list