[asterisk-dev] scalability issue with realtime lastms update on qualify state change

Mon Nov 18 09:15:16 CST 2013

>> >> If you really are averse to the effect that saving lastms in the
>> >> database causes, why don't you just remove that column from your
>> >> realtime tables?  We've had this ability to remove columns that you
>> >> prefer not to save in dynamic realtime since 1.6.2, and it sounds
>> >> like this is, in effect, precisely what you'd prefer.
>> >>
>> >> -Tilghman
>> >>
>> >
>> > Can you help me understand what removing the column would do? I did not
>> see anything in the code that would stop the database update attempt if the
>> column did not exists. Would it not try, or would it just fail gracefully?
>>
>> The meta-code would simply drop that column from the UPDATE, if it doesn't
>> exist in the target table.
>
> Are we talking about something implemented in realtime_odbc, or also in realtime_mysql?
> And if it is the only column in the update, as in the case of a peer state change update?
> The generated update query on peer state change is 'update sipusers set lastms = [value] where name = [peername]'
It's implemented in both the ODBC and the MySQL realtime drivers.  In
both cases, however, the first key/value pair should always exist and
if it doesn't, both will fail:  the MySQL driver will emit an ERROR,
and the ODBC driver will attempt to execute what will be invalid SQL.
This is clearly a bug in the ODBC driver.  This situation probably
could be fixed in the realtime drivers to simply short-circuit and
cancel out the operation when no columns can be updated.

> BTW, thanks for taking the time to discuss this.
I still think the better option would be to control the rate at which
probes for peers are sent out, so their responses can be received and
processed on time.  Either of these present solutions are merely
attempting to code to the symptom, rather than fixing this underlying
problem.  However, moving the probes from the individual peer threads
to a single background thread is not an easy or simple change, which
is probably why you haven't attempted it.

I agree with Tilghman. The correct way of handling this would be to either offload the database queries to a separate thread, or multi-thread the entire system such that a blocking call to the database does not impact other request handling. This is exceptionally non-trivial to accomplish in chan_sip, as it has no concept of asynchronous callbacks and resuming operations.

This is a large part of why we went with a new architecture in the PJSIP stack, which doesn't suffer from this particular limitation (a thread pool is used for request/response handling, so responses that take significant time to process do not impact the handling of other responses). I understand that waiting for Asterisk 12/13 may not be an option and may require you to attempt to solve this in chan_sip; however, I have a feeling that such an effort would be an exceptionally difficult project.

Matt

While I agree from a long term perspective, I maintain the position that the value of having the peer state in the RT database is limited to a minority of users. With that behind said, a simple switch to turn off updates to the RT database updates on qualify state change, without affecting default behavior, is a good solution for 1.8. An imperfect solution is better than no solution and nobody is hurt if the default behavior does not change.

There are a couple factors that keep me from tackling the project of threading the updates, one of which is that I don't know the code well enough and I have limited programming experience, and the other is that I don't rely on the lastms data in any way, so it would be a lot of work for zero benefit. A selfish stance I know, but I still do not understand how realtime lastms data is used in other user's solutions. I only heard from one person that uses it to make routing decisions [Leandro Dardini]. The "right" solution is going to require some additional participation from someone that has a need for the data and a large scale deployment that has been impacted by the issue.

I have attached a patch (against 1.8.15.0) that I am using that addresses a few sip scalability and timer issues, this being one of them. The changes are:

1. Adds a configuration parameter 'rtlastms=yes|no' that can be used to turn off realtime database updates on qualify state change. It also gracefully omits the field lastms from register and expire realtime database updates.
2. Adds the ability to define the frequency of qualify pokes when a peer is offline at the general and peer level with the use of a parameter qualifynotok=[time in ms].
3. Fixes a bug where t1min is enforced on peers with qualify=no. I looked at the history of t1min and it was never intended to impact peers with qualify=no, that is what 'timert1=' is for. I also changed the sanity check on timer t1 to < 50ms.
4. Fixes a sanity check on timerB.

I would submit the patch to the issue tracker, but the feedback here has suggested it would not be merged.

Damon

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131118/f3fc13da/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: chan_sip.c.patch
Type: application/octet-stream
Size: 15356 bytes
Desc: chan_sip.c.patch
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131118/f3fc13da/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sip.h.patch
Type: application/octet-stream
Size: 1993 bytes
Desc: sip.h.patch
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131118/f3fc13da/attachment-0003.obj>