[asterisk-dev] [Code Review] Append two container for dialogs to delete and for rtp timeout checks to replace all dialogs container in ao2 callback for dialogs_needdestroy

sst at sil.at sst at sil.at
Fri Sep 17 15:23:05 CDT 2010



> On 2010-09-17 14:22:49, David Vossel wrote:
> > For some reason I was under the impression this patch was against Trunk instead of 1.6.2.
> > 
> > This is not entirely my call to make, but I do not believe this has a high probability of being accepted for merger into 1.8, and virtually no probability of being merged into 1.6.2.
> > 
> > If the patch is targeted for merger only into trunk, solving the RTP container issue is easy as we have already identified where linking should occur.  Were you expecting this to go into 1.6.2?  If so, I'd begin by discussing this with Russell.
> 
> schmidts wrote:
>     sorry for the missunderstanding. as far as i have seen in trunk these part of chan_sip.c is nearly the same like in 1.6.2.
>     
>     i will make a patch and test it against trunk. and also i will talk to Russell. 
>     
>     As far as i have understood him, and atleast why i have started with this, this slow iterate could cause deadlocks in chan_sip.c
>     
>     thank you David
> 
> David Vossel wrote:
>     When there are issues with deadlocks occurring with this monitor loop, they usually occur outside of this thread because of invalid locking order.  This optimization may lessen the probability of a deadlock occurring since it reduces the number of dialogs traversed at a single time, but if invalid locking order exists this patch does nothing to help that... At least I can't think of any locking issues resolved by this patch, I could be wrong :)

i am not sure if its a deadlock or atleast just a lock which tooks too long, but i have seen the following problem with 1.6.2, live on my production system.
if the ao2 callback to iterate through all dialogs needs to much time, the ast_sched_wait timeout will sometimes be near to 0 ms, then ast_io_wait polls only 1 incoming message from the fd (if sip runs only on udp, no tcp involved) and then in the ast_sched_runq would be many messages send. 
after several seconds the ratio from reading incoming messages and sending messages will sometimes grows up to 1:100. This cause several events of unreachable poke peers and so on.
you can simple try this registering for example 10k peers on a normal 1.6.2 system. at a point the first user gets unrechable (in my tests, around 7k) then every further second more and more peers goes offline cause there is no time left to read all incoming messages, but send out much.

there are two parts which slow down the do_monitor loop, first is allready fixed by russell after a peers registered, the contact data is written into the astdb. the second part is this patch.
with both parts the ratio between reading and sending is overall 1:1 cause its fast enough.
i am with you, a real deadlock or lock happens on another place/thread, and this patch fixes nothing against this locks. but its a big different in waiting 40 usec for a pvt to come free or 2000 usec.


- schmidts


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/917/#review2741
-----------------------------------------------------------


On 2010-09-17 08:56:31, schmidts wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/917/
> -----------------------------------------------------------
> 
> (Updated 2010-09-17 08:56:31)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Summary
> -------
> 
> Instead of iterate through all sip dialogs in the do_monitor function of chan_sip.c i created two separate containers which links to dialogs. 
> The container dialogs_needdestroy links to dialogs which are marked as needdestroy, so they should be unrefered from the dialogs container.
> The container dialogs_rtpcheck links to dialogs which have active rtp, vrtp or trtp and rtptimeout, rtpholdtimeout or rtpkeepalive activated.
> 
> normaly the container with needdestroy dialogs is empty so the callback takes only around 30 usec. even if there are more than 10 dialogs to be destroyed it takes around 200 usec. 
> the callback for the rtpcheck tooks around 200 usec with 30 concurrent calls and 700 usec if the timeout is reached to do a softhangup.
> 
> further i have found in check_rtp_timeout a possible locking situation. i am not sure if this would be a working solution cause it could happen that a call would be hung up several ms or also 1 sec later.
> 
> i think after speeding the ao2 callback up, its not necessary to wait 1000 ms, with 100ms dialogs would be faster released, the timeout fires even shorter and scheduled events which has been scheduled in a very short future time would be fired faster. see change 40 for this.
> 
> i am also not sure if its necessary to iterate through the containers on module_unload to unlink the entries. the dialogs itself would be destroyed after they are unrefered from the dialogs container. see change 42 for this.
>  
> 
> 
> This addresses bug 17912.
>     https://issues.asterisk.org/view.php?id=17912
> 
> 
> Diffs
> -----
> 
>   branches/1.6.2/channels/chan_sip.c 287304 
> 
> Diff: https://reviewboard.asterisk.org/r/917/diff
> 
> 
> Testing
> -------
> 
> i have tested this with sipp and 500 concurrent calls all with rtptimeout 10 s and the active time was 10,9s per call.
> also the 20k peers register test run through without any problem. (with the astdbpatch)
> 
>                                  Messages  Retrans   Timeout   Unexpected-Msg
>       INVITE ---------->         500       0
>          100 <----------         500       0                   0
>          200 <----------         500       0                   0
>          ACK ---------->         500       0
>        Pause [    40.0s]         500                           500
>          BYE ---------->         0         0         0
>          200 <----------         0         0                   0
> ------------------------------ Test Terminated --------------------------------
> 
> 
> ----------------------------- Statistics Screen ------- [1-9]: Change Screen --
>   Start Time             | 2010-09-11 23:12:32
>   Last Reset Time        | 2010-09-11 23:12:47
>   Current Time           | 2010-09-11 23:12:48
> -------------------------+---------------------------+--------------------------
>   Counter Name           | Periodic value            | Cumulative value
> -------------------------+---------------------------+--------------------------
>   Elapsed Time           | 00:00:00:964              | 00:00:15:994
>   Call Rate              |    0.000 cps              |   31.262 cps
> -------------------------+---------------------------+--------------------------
>   Incoming call created  |        0                  |        0
>   OutGoing call created  |        0                  |      500
>   Total Call created     |                           |      500
>   Current Call           |        0                  |
> -------------------------+---------------------------+--------------------------
>   Successful call        |        0                  |        0
>   Failed call            |       35                  |      500
> -------------------------+---------------------------+--------------------------
>   Call Length            | 00:00:10:970              | 00:00:10:637
> ------------------------------ Test Terminated --------------------------------
> 
> 
> Thanks,
> 
> schmidts
> 
>




More information about the asterisk-dev mailing list