[asterisk-dev] Suggestion on Packet Loss Concealment Algorithm
Steve Underwood
steveu at coppice.org
Fri May 19 19:10:49 MST 2006
zuo bf wrote:
> Hi Steve,
>
> inline
>
> On 5/19/06, *Steve Underwood* <steveu at coppice.org
> <mailto:steveu at coppice.org>> wrote:
>
> ColinZuo at viatech.com.cn <mailto:ColinZuo at viatech.com.cn> wrote:
>
> > Hi,
> >
> > The theory is not based on the music, it's based on that given
> by the
> > ITU G.711 Appendix I (BTW: the music is converted to
> 8K/mono/16bit by
> > CoolEdit).
> >
> What works well for music is very different from what works well for
> voice.
>
>
> yeah, but i don't think the difference is so big unless you give me a
> voice file to prove me wrong.
> And again the reason i prolong it based on theory given by G.711
> Appendix I, which is said to be
> derived from experimentation of BELL.
Just because its derived from Bell, doesn't make it the word of God. For
example, the pitch search only goes down to 66Hz. The F0 of my voice can
go well below 50Hz, and the pitch is completely messed up. As I said,
for music a far slower decay characteristic works a lot better. Also,
windowing before the AMDF will give better temporal localisation of the
pitch estimate. This is pretty much a waste of time for voice, but it
helps stabilise the pitch for music reducing the watery quality of the
synthetic sound on higher pitches. All good modern codecs do some form
of fractional pitch search to reduce wateriness in female (i.e. high
pitched) voices. This PLC algorithm does everything in whole samples. I
suspect the fractional pitch approach would noticeably help quality, but
at substantial computational expense.
> G.711 Appendix 1 and my code fade to silence over 50ms. For music
>
> much greater sustain to fill in the gaps works much better. With
> speech,
> that badly affects intelligibility.
>
>
> I didn't change this, BTW, G.711 Appendix I fade to silence over 60ms
> because it doesn't
> fade for the first erasure but you did and i think as you can't know
> the wave are going to
> rise or down you'd better keep the same level for the first erasure.
Ah, I forgot about this. Its something that isn't very sane in Appendix
1, and I never went back to experiment with. Several areas of Appendix 1
are very much oriented to 10ms packets. In the real world hardly anyone
uses 10ms packets. I suspect the decay rate should be different for 20ms
or 30ms packets, and that requires investigation.
> ////////////////////////////////////////////////////////////////////////////////////////////////
> G.711 Appendix I
> I.2.4 Synthetic signal generation for first 10 ms
> For the first 10 ms of the erasure, the best results are obtained by
> generating the synthesized signal
> from the last pitch period with no attenuation.
> /////////////////////////////////////////////////////////////////////////////////////////////////////
>
> I used the Appendix 1 approach
> without experimenting. I suspect something other than linear
> attenuation
> would behave better.
>
>
> By experimentation, i think as long as the algorithm aimed at Generic
> Linear concealment,
> probably you cann't find one much better than this, unless you analyse
> some voice parameters from
> previous samples.
Actually, there are rather better concealment algorithms, but they
require greater amounts of computation. Try a Google search. Several
people have reported results using LPC analysis and synthesis which seem
better, especially for longer erasures.
>
> > And the current plc algorithm is similar to the G.711 Appendix I
> except:
> > 1. The pitch detection algorithm : G.711 Appendix I uses cross
> > correlation, but Asterisk uses AMDF which is simpler and also
> performs
> > well
> >
> Correct.
>
> > 2. The OLA window: G.711 update the OLA window length when burst
> loss
> > occurs, but Asterisk didn't
> >
> Wrong. They both use the same OLA strategy - 1/4 pitch period overlap.
>
>
> G.711 will prolong the OLA window by 4ms until it reached 10ms, but
> the Asterisk one doesn't?
>
> ////////////////////////////////////////////////////////////////////////////////////////////////
> G.711 Appendix I
> I.2.7 First good frame after an erasure
> At the first good frame after an erasure, a smooth transition is
> needed between the synthesized
> erasure speech and the real signal. To do this, the synthesized speech
> from the pitch buffer is
> continued beyond the end of the erasure, and then mixed with the real
> signal using an OLA. The
> length of the OLA depends on both the pitch period and the length of
> the erasure. For short, 10 ms
> erasures, a 1/4 wavelength window is used. For longer erasures the
> window is increased by 4 ms per
> 10 ms of erasure, up to a maximum of the frame size, 10 ms.
> ////////////////////////////////////////////////////////////////////////////////////////////////
>
> > 3. The nearby field of the first erasure: G.711 delays the
> output for
> > 3.75 ms to compensate the probable loss, but Asterisk just use the
> > symmetrical
> >
> > part before the lost to do the OLA. The one G.711 Appendix I
> utilized
> > should be better, but it's not very important as human being's ears
> > are really anti-jamming.
> >
> That 3.75ms delay is so the Appendix 1 algorithm can do a 1/4 pitch
> period of OLA when erasure commences. However, it incurs lots of
> buffer
> copying when there are no lost packets. What my code does is time
> reverse the last 1/4 pitch period and OLA with that. It sounds nasty,
> but listening tests with speech showed it was very close to the
> sound of
> the G.711 appendix 1 algorithm, and improves efficiency a lot in the
> common case - no packets being lost.
>
>
> Yeah, the result are similar, but the difference is just 3.75 ms
> delay, i didn't see
> more buffer copying than necessary, both algorithm save the same
> history (although G.711 keeps
> a longer one and delay for 3.75ms)
> BTW: packet loss is very common at least in China, and the burst loss
> can last very long.
> For example, as the bandwith between the two major carriers are very
> low, two user from each
> will experience packet loss very often if they use the public internet
> not some softswitch network.
There is a lot more copying in the Appendix 1 algorithm. It not only
saves a copy of the audio. It has to rearrange the output buffer to be
delayed by 30 samples. When there are no erasures the difference in
compute requirements is substantial. Enough to make me rework the
algorithm to optimise the common case. If you don't think no erasures is
the common case you have real problems. :-)
In Southern China my experience has been of very very low packet loss,
and the full bandwidth of ADSL connections being available most of the
time. International comms can be more congested, but there is a lot of
local overcapacity. I don't know much about Northern China.
> > I prolong the pitch period to a maximum of 3 pitch period, but
> > Asterisk only uses one which
> >
> > saves memory but behave bad at burst loss.
> >
> For ptolonged erasures G.711 Appendix 1 and my code act in exactly the
> same way. They linearly attenuate to zero over the first 50ms. In
> that
> period they repeat the last 1.25 pitch periods of real speech, with a
> quarter pitch period of overlap. When real speech restarts they
> both do
> a 1/4 pitch period of OLA, based on the last known pitch. The
> algorithms
> are identical beyond the initial 1/4 pitch period of OLA. Why would
> anyone want to save memory here? It only uses a small amount. The
> algorithmic changes were to reduce the buffer manipulation in the
> common
> case.
>
> > 4. whether prolong the pitch period during burst loss: G.711 Appendix
>
> Not the same.
>
> ////////////////////////////////////////////////////////////////////////////////////////////////
> G.711 Appendix I
> I.2.5 Synthetic signal generation after 10 ms
> If the next frame is also erased, the erasure will be at least 20 ms
> long and further action is required.
> While repeating a single pitch period works well for short erasures
> (e.g. 10 ms), on long erasures it
> introduces unnatural harmonic artifacts (beeps). This is especially
> noticeable if the erasure lands in
> an unvoiced region of speech, or in a region of rapid transition such
> as a stop. It was discovered by
> experimentation that these artifacts are significantly reduced by
> increasing the number of pitch
> periods used to synthesize the signal as the erasure progresses.
> Playing more pitch periods increases
> the variation in the signal. Although the pitch periods are not played
> in the order they occurred in the
> original signal, the resulting output still sounds natural. At 10 ms
> into the erasure the number of pitch
> periods used to synthesize the speech is increased to two, and at 20
> ms a third pitch period is added.
> For erasures longer than 20 ms no additional modifications to the
> pitch buffer are made.
> ////////////////////////////////////////////////////////////////////////////////////////////////
Actually, people complain Appendix 1 PLC implementations also beep.
You'll find that improvements in that area are one of the main claims
for the LPC based PLC algorithms. I'd have to go back and check on this.
Its a while since I wrote the code. If I diverged from the Appendix 1
algorithm I must have done so for a good reason, like it simplified
something without noticeable impact on qualtity.
> I think the documentation for my PLC code is missing from the Asterisk
>
> No, it's available in plc.h under asterisk/include. :)
>
> source code, but you can find it at
> http://www.soft-switch.org/spandsp-doc/plc_page.html
>
> Regards,
> Steve
>
As I said before you really have to try voice, and not music. It makes a
huge difference. If you try a continuous tone the PLC algorithm behaves
terribly, but that's another case nobody cares about. :-)
Regards,
Steve
More information about the asterisk-dev
mailing list