[asterisk-dev] Help Solve the Mysterious Slowdown in Sip channel driver in Trunk!

Fri Jun 27 18:44:58 CDT 2008

I am seeking help in solving a Mystery!!

It goes something like this:

In 1.4, the sip channel driver works fine.
In trunk, it's slower than a dog. An old, old
dog with a missing leg (or two).

ctooley has done tests, and the response
time to an INVITE seems to be the culprit.

I have done several experiments, all of which
are in the http://svn.digium.com/svn/asterisk/team/murf/mtxprof
branch:

(Oh, and I've unchecked the res_timing_pthread
in make menuselect!)

1. I upgraded the MTX_PROFILE capability to report each
lock's accumulated and average times separately. Previously,
a single counter was used for all locks performed within a
module. So, you had a set of per-module totals.

Results from this: the channel lock at the beginning
of __ast_answer() was taking large amounts of time to
aquire the lock; also the channel lock in the nandfds
routine is also taking a while to get the lock, but
not as bad as __ast_answer's.

2. I found a bug in the "core show locks" code, in that
the lock for the channel acquired in handle_request_invite
was not present in the lock_info data. I haven't found out
why yet, but I did instead jimmy the ast_channel_lock, unlock
and trylock routines to log to the console all channel locks.
While this generates pretty verbose output, it clearly shows
and explains the long lock acquisitions in __ast_answer and
the nandfds routine... in the case of __ast_answer, here's
what's going on:
   1. handle_request_invite gets channel lock after
      creating the channel.
   2. ast_pbx_start starts up a thread to run the pbx
      stuff on the channel.
   3. the target code, the routine pbx_thread, starts
      running before ast_pbx_returns; __ast_answer is
      called, but blocks at the ast_channel_lock() call
      until the thread parent unlocks the channel.
   4. At some point, the ast_pbx_start() routine returns,
      and handle_request_do() unlocks the channel.
   5. __ast_answer() gets the lock and continues.

   In the case of the nandfds routine, it's because the
   ast_queue_frame() routine sometimes beats it the lock
   and makes it wait until it's finished.

3. I added code to tell me exactly how long the 
   ast_pbx_start() takes to return, and also, more
   specifically, the underlying pthread call,
   pthread_create(). It was the only big thing between
   the pbx_start operation and releasing the channel lock.
   And I hit paydirt: ast_pbx_start() takes a long time
   because pthread_create() takes a long time. What's a 
   'long time'? Well, in 1.4, I timed pthread_create() 
   at from 25-60 microseconds. Usually in the 30's and
   40's. In the trunk version, I timed the pthread_create
   call as taking from 8000 to 47000 microsec. That's
   pretty noticeable difference!

4. Wondering why pthread_create() would/could care which
   version of asterisk was executing it, I thought it might
   be helpful to see if the parent thread (the one running
   the handle_request_do() routine, might be overly active.
   So, I add call counters to the 9 routines that chan_sip
   runs via the sched utility. I detected no overly active
   routines that could explain run-away loading.

5.  I read this on the web:
    http://www.ecn.wfu.edu/glibc-linux-archive/0003/0034.html
    in the which it states:

     "That suggests that the scheduler is slow in getting 
      this thread up and running; probably because you have 
      swamped it with lots of other threads that are
      chewing up significant portions of their time quanta."

      While this could be a factor in general, I don't detect
      it in this situation. My machine is running with a totally
      idle asterisk. I then attack it with sipp running at the
      default rate of 10 calls/sec, .1 sec duration calls.
      Even with CDR logging, and verbose output to the console,
      top never sees asterisk at greater than .3% cpu, and
      the idle time is 97% to 100%.  

6. I marked everything as embedded, and turned on the -pg 
   profiling for Asterisk. (You have to add a line by hand
   to the Makefile to get this to happen:
ASTLDFLAGS += -pg
   as well as adding -pg to the DEBUG macro:
DEBUG=-g3 -pg
   This revealed no excessive time spent anywhere. Might
   be better to do this a stock trunk branch and run sipp
   for a long time at moderate speeds, to see if anything
   spikes usage.

OK, and here's another clue: pthread_create() time isn't always
high in the current trunk version. As modules are busily loading,
I see the standard pthread_create time from the mid-20's (micro
seconds) to 60's, avg in the 40's. Just after Asterisk finishes
loading modules, and fires up the CLI, I see this:

[Jun 27 16:36:39] NOTICE[25212]: utils.c:1091 ast_pthread_create_stack:
pthread_create took 48 microseconds, avg=51 usec (33 times)
.........[Jun 27 16:36:39] WARNING[25212]: app_minivm.c:2399
load_config: Failed to load configuration file. Module activated with
default settings.
..[Jun 27 16:36:39] WARNING[25212]: app_meetme.c:5548 sla_load_config:
Invalid entry in sla.conf defined with no type!
[Jun 27 16:36:39] WARNING[25212]: app_meetme.c:5548 sla_load_config:
Invalid entry in sla.conf defined with no type!
............................... ]
Asterisk Ready.
[Jun 27 16:36:39] NOTICE[25212]: utils.c:1091 ast_pthread_create_stack:
pthread_create took 47 microseconds, avg=51 usec (34 times)
*CLI> 
[Jun 27 16:36:39] NOTICE[25212]: chan_sip.c:16417
handle_response_peerpoke: Peer 'polycom430' is now Reachable. (13ms /
4000ms)
[Jun 27 16:36:40] NOTICE[25212]: chan_sip.c:16417
handle_response_peerpoke: Peer 'murf-eyebeam' is now Reachable. (108ms /
4000ms)
[Jun 27 16:36:40] NOTICE[25212]: utils.c:1091 ast_pthread_create_stack:
pthread_create took 1627 microseconds, avg=96 usec (35 times)
[Jun 27 16:36:41] NOTICE[25212]: utils.c:1091 ast_pthread_create_stack:
pthread_create took 1601 microseconds, avg=138 usec (36 times)
[Jun 27 16:36:42] NOTICE[25212]: utils.c:1091 ast_pthread_create_stack:
pthread_create took 1560 microseconds, avg=177 usec (37 times)
[Jun 27 16:36:43] NOTICE[25212]: utils.c:1091 ast_pthread_create_stack:
pthread_create took 1561 microseconds, avg=213 usec (38 times)

Notice that just before the first CLI prompt, the pthread_create() call 
took just 47 microsec; but the next 4 calls jump to around 1600 usec!!!

(and it goes up from there!).

BTW, the last 4 pthread_create calls reported above, were from the
dahdi channel driver, as I have 4 fxs and 4 fxo lines in that test
system.

When I fire off a second or so of sipp calls at this switch, I see
INVITE pthread_create calls like this:

[Jun 27 16:43:06] NOTICE[25212]: utils.c:1091 ast_pthread_create_stack:
pthread_create took 6966 microseconds, avg=386 usec (39 times)
[Jun 27 16:43:06] NOTICE[25212]: chan_sip.c:17975 handle_request_invite:
PBX is started -- took 98418 microseconds, avg=98418 usec (1 times)

[Jun 27 16:43:06] NOTICE[25212]: utils.c:1091 ast_pthread_create_stack:
pthread_create took 723 microseconds, avg=394 usec (40 times)
[Jun 27 16:43:06] NOTICE[25212]: channel.c:1172 channel_find_locked:
Failure, could not lock '0x879ded0' after 199 retries!
[Jun 27 16:43:06] NOTICE[25212]: channel.c:1172 channel_find_locked:
Failure, could not lock '0x879ded0' after 199 retries!
[Jun 27 16:43:06] NOTICE[25212]: channel.c:1172 channel_find_locked:
Failure, could not lock '0x879ded0' after 199 retries!
[Jun 27 16:43:06] NOTICE[25212]: channel.c:1172 channel_find_locked:
Failure, could not lock '0x879ded0' after 199 retries!
[Jun 27 16:43:06] NOTICE[25212]: chan_sip.c:17975 handle_request_invite:
PBX is started -- took 88206 microseconds, avg=93312 usec (2
times)                                                             

Normally, pbx_start time adds about 20 microsec of overhead to the
thread_create
call, but in the above, a huge amount of time was added. Maybe all the
call
logging overhead, I don't know. Just started showing this in the last
few runs.

OK, I think I've shared all the little factoids about this
problem, and what I've done. Does anyone have any ideas?
I'm going to go back to solving easier puzzles and let this
simmer in my subconscious for a while. If anyone has any ideas
of what the trunk version is doing that would trigger this
slowdown, please, share it!

murf

-- 
Steve Murphy
Software Developer
Digium
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3227 bytes
Desc: not available
Url : http://lists.digium.com/pipermail/asterisk-dev/attachments/20080627/2d682160/attachment.bin