[asterisk-bugs] [JIRA] (ASTERISK-22563) Realtime database connections dropping

Fri Sep 20 07:23:03 CDT 2013

     [ https://issues.asterisk.org/jira/browse/ASTERISK-22563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ben Smithurst updated ASTERISK-22563:
-------------------------------------

    Attachment: SIGURG-strace.txt
                res_odbc.conf
                odbc.ini
    
> Realtime database connections dropping
> --------------------------------------
>
>                 Key: ASTERISK-22563
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-22563
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>    Affects Versions: 11.5.1
>            Reporter: Ben Smithurst
>            Severity: Minor
>         Attachments: odbc.ini, res_odbc.conf, SIGURG-strace.txt
>
>
> We are seeing a periodic problem with our database connections dropping, with a log message along these lines:
> {code}
> [2013-09-20 10:58:24] WARNING[35309] res_odbc.c: SQL Execute returned an error -1: 08S01: [MySQL][ODBC 5.1 Driver][mysqld-5.1.49-1~bpo50+1-log]Lost connection to MySQL server during query (97)
> {code}
> After finally catching it with strace (attached), it appears to be due to some SIGURG signals being received, interrupting the {{read}} system call.
> We are at a bit of a loss to know why this is happening, and are wondering if anyone has seen anything similar and/or can just shed any light on this.
> Our investigation so far has concluded that Asterisk sets up a no-op handler for SIGURG, with SA_RESTART, so the {{read}} should be restarted, and indeed it is, but only 3 times -- after that, the socket is closed with {{shutdown}} and then {{close}} (I believe this is inside libmysqlclient).  If the system call *wasn't* restarted, then it looks from the libmysqlclient source, that it should do so anyway, on an EINTR return value.
> A tcpdump also confirms that Asterisk is sending a query, and then closing the socket before data is even received back from the server - so it is not caused by a bad query.
> There appears to be no easy way to reproduce this, it is *very* intermittent - we are doing some testing of a new server and have various automated tests running 24/7, and on average it happens maybe 2 or 3 times per day.  The problem seems to be often, but not always, on an {{UPDATE sippeers}} query - I am not sure if that is relevant.  We are only using SIP, no other channel types (except Local).
> We are not averse to digging into the code to help solve this problem, however if anyone has any input on how the different threads interact in terms of what SIGURG is being used for, it might help us know where to look.  For example, would it be appropriate to block SIGURG using {{sigprocmask}} or similar while doing realtime database queries, or could that cause other problems?
> Finally, another oddity is that it takes Asterisk 5 seconds to reconnect - compare the timestamps on e.g. lines 19 and 48 of the strace, this appears to consistent, maybe due to the timeout on the futex call?
> Thanks for any input
> (internal ref AST-133)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.asterisk.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira