[asterisk-bugs] [JIRA] (ASTERISK-27615) Dialplan deadlock when connection to external SQL server is lost

Wed Jan 24 18:36:49 CST 2018

    [ https://issues.asterisk.org/jira/browse/ASTERISK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=241792#comment-241792 ] 

Richard Mudgett edited comment on ASTERISK-27615 at 1/24/18 6:35 PM:
---------------------------------------------------------------------

I've spun up a fresh instance with
unixODBC 2.3.5-1.el7
mysql-connector-odbc 5.2.5-6.el7
I've tried pooling=no and no pooling option at all in odbcinst.ini
I'm still having problems.

If I start networking on the SQL server while the dialplan is locked up, calls will start getting through again and I will see something like this.
{noformat}res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1{noformat}
If I restart networking and SQL at the same time (since SQL stops functioning when I kill the network), I instead see this as dialplan starts executing again.
{noformat}func_odbc.c:486 execute: SQL Exec Direct failed (-1)![SELECT COUNT(*) FROM blacklist_global WHERE cid_number='03a1']{noformat}

When bringing networking back up, lines which get/set with the function CDR() will sometimes do this.
{noformat}
res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1
[Jan 24 23:31:30] WARNING[39]: cdr_adaptive_odbc.c:436 odbc_log: cdr_adaptive_odbc: Unable to retrieve database handle for 'dev-dallas-sql1:cdr_event_log'.  CDR failed: INSERT INTO cdr_event_log (
[Jan 24 23:31:30] WARNING[39]: res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1
[Jan 24 23:31:30] WARNING[39]: cdr_adaptive_odbc.c:436 odbc_log: cdr_adaptive_odbc: Unable to retrieve database handle for 'dev-dallas-sql1:cdr_event_log'.  CDR failed: INSERT INTO cdr_event_log (
{noformat}

An interesting pattern, for the first call that dialplan processes after the SQL server goes offline everything will be fine.
{noformat}All of the gets with Set(var=${CDR(something)}) and the sets with SET(CDR(something)=${var}) will work.{noformat}
First call is ok all the way through.
{noformat}
    -- Executing [*97 at from-internal:1] Goto("PJSIP/103a1-00000024", "vm103,1") in new stack
    -- Goto (from-internal,vm103,1)
    -- Executing [vm103 at from-internal:1] Gosub("PJSIP/103a1-00000024", "subroutine-cdr,s,1(11,58de7c34bfec4)") in new stack
    -- Executing [s at subroutine-cdr:1] Set("PJSIP/103a1-00000024", "saverecpath=") in new stack
    -- Executing [s at subroutine-cdr:2] Set("PJSIP/103a1-00000024", "savebillable=") in new stack
    -- Executing [s at subroutine-cdr:3] ForkCDR("PJSIP/103a1-00000024", "e") in new stack
    -- Executing [s at subroutine-cdr:4] Set("PJSIP/103a1-00000024", "CDR(cdr_event_type)=11") in new stack
    -- Executing [s at subroutine-cdr:5] Set("PJSIP/103a1-00000024", "CDR(cdr_eventID)=58de7c34bfec4") in new stack
    -- Executing [s at subroutine-cdr:6] Set("PJSIP/103a1-00000024", "CDR(cdr_callerid_num)=103a1") in new stack
    -- Executing [s at subroutine-cdr:7] Set("PJSIP/103a1-00000024", "CDR(cdr_callerid_name)=x103 &") in new stack
    -- Executing [s at subroutine-cdr:8] Set("PJSIP/103a1-00000024", "CDR(cdr_record_file_location)=") in new stack
    -- Executing [s at subroutine-cdr:9] Set("PJSIP/103a1-00000024", "CDR(cdr_billable)=") in new stack
    -- Executing [s at subroutine-cdr:10] Return("PJSIP/103a1-00000024", "") in new stack
    -- Executing [vm103 at from-internal:2] StopMixMonitor("PJSIP/103a1-00000024", "") in new stack
    -- Executing [vm103 at from-internal:3] VoiceMailMain("PJSIP/103a1-00000024", "103 at voicemail,s") in new stack
  == Spawn extension (from-internal, vm103, 3) exited non-zero on 'PJSIP/103a1-00000024'
    -- Executing [h at from-internal:1] Set("PJSIP/103a1-00000024", "CDR(cdr_hangupcause)=16") in new stack
    -- Executing [h at from-internal:2] GotoIf("PJSIP/103a1-00000024", "0?txfax") in new stack
    -- Executing [h at from-internal:3] GotoIf("PJSIP/103a1-00000024", "0?rg") in new stack
    -- Executing [h at from-internal:4] GotoIf("PJSIP/103a1-00000024", "0?rg") in new stack
    -- Executing [h at from-internal:5] GotoIf("PJSIP/103a1-00000024", "0?queue") in new stack
    -- Executing [h at from-internal:6] Goto("PJSIP/103a1-00000024", "done") in new stack
    -- Goto (from-internal,h,28)
    -- Executing [h at from-internal:28] NoOp("PJSIP/103a1-00000024", "") in new stack
{noformat}
All calls after the first will start to hang at the first get var=CDR()
{noformat}
    -- Executing [*97 at from-internal:1] Goto("PJSIP/103a1-00000025", "vm103,1") in new stack
    -- Goto (from-internal,vm103,1)
    -- Executing [vm103 at from-internal:1] Gosub("PJSIP/103a1-00000025", "subroutine-cdr,s,1(11,58de7c34bfec4)") in new stack
---presumably it is hanging in line 1 of this subroutine which is Set(saverecpath=${CDR(cdr_record_file_location)})
{noformat}

I can start 3 calls which execute the voicemail dialplan as long as I don't hang any of them up.
When I hang them up, the first call to be hung up finishes the hangup handler but the rest stop after the first line
{noformat}
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: channel.c:2590 ast_softhangup_nolock: Soft-Hanging (0x80) up channel 'PJSIP/103a1-0000002d'
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: pbx_variables.c:381 ast_str_retrieve_variable: Result of 'HANGUPCAUSE' is '16'
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: pbx.c:2914 pbx_extension_helper: Launching 'Set'
    -- Executing [h at from-internal:1] Set("PJSIP/103a1-0000002d", "CDR(cdr_hangupcause)=16") in new stack
--the next lines are not not in the verbose debug log for the other calls being hung up, I think they are still stuck in the first line or maybe it can't finalise the CDR--
[Jan 25 00:13:11] DEBUG[39]: cdr.c:1296 cdr_object_finalize: Finalized CDR for PJSIP/103a1-0000002d - start 1516839184.775040 answer 1516839184.804170 end 1516839191.362747 dispo ANSWERED
same => n,GotoIf($["${CDR(cdr_event_type)}" = "2"]?txfax)
{noformat}
All of this seems to suggest that the first instance of a set CDR()= blocks all gets and sets thereafter.
{noformat}exten => h,1,Set(CDR(cdr_hangupcause)=${HANGUPCAUSE}){noformat}

I recompiled asterisk with BETTER_BACKTRACES and DEBUG_THREADS, but I still can not get ast_coredumper to write anything other than headers to the 4 files it creates.
There I can see in the verbose debug log that is any different from things that work, all I see is missing lines because dialplan stops working.

I can try to get some logs for you and figure out how to dump the core, but I think the quick fix for us right now might be to implement an HTTP API instead of ODBC for our CDR and dialplan func_odbc queries.

was (Author: fortytwo):
I've spun up a fresh instance with
unixODBC 2.3.5-1.el7
mysql-connector-odbc 5.2.5-6.el7
I've tried pooling=no and no pooling option at all in odbcinst.ini
I'm still having problems.

If I start networking on the SQL server while the dialplan is locked up, calls will start getting through again and I will see something like this.
{code}res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1{code}
If I restart networking and SQL at the same time (since SQL stops functioning when I kill the network), I instead see this as dialplan starts executing again.
{code}func_odbc.c:486 execute: SQL Exec Direct failed (-1)![SELECT COUNT(*) FROM blacklist_global WHERE cid_number='03a1']{code}

When bringing networking back up, lines which get/set with the function CDR() will sometimes do this.
{code}
res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1
[Jan 24 23:31:30] WARNING[39]: cdr_adaptive_odbc.c:436 odbc_log: cdr_adaptive_odbc: Unable to retrieve database handle for 'dev-dallas-sql1:cdr_event_log'.  CDR failed: INSERT INTO cdr_event_log (
[Jan 24 23:31:30] WARNING[39]: res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1
[Jan 24 23:31:30] WARNING[39]: cdr_adaptive_odbc.c:436 odbc_log: cdr_adaptive_odbc: Unable to retrieve database handle for 'dev-dallas-sql1:cdr_event_log'.  CDR failed: INSERT INTO cdr_event_log (
{code}

An interesting pattern, for the first call that dialplan processes after the SQL server goes offline everything will be fine.
{code}All of the gets with Set(var=${CDR(something)}) and the sets with SET(CDR(something)=${var}) will work.{code}
First call is ok all the way through.
{code}
    -- Executing [*97 at from-internal:1] Goto("PJSIP/103a1-00000024", "vm103,1") in new stack
    -- Goto (from-internal,vm103,1)
    -- Executing [vm103 at from-internal:1] Gosub("PJSIP/103a1-00000024", "subroutine-cdr,s,1(11,58de7c34bfec4)") in new stack
    -- Executing [s at subroutine-cdr:1] Set("PJSIP/103a1-00000024", "saverecpath=") in new stack
    -- Executing [s at subroutine-cdr:2] Set("PJSIP/103a1-00000024", "savebillable=") in new stack
    -- Executing [s at subroutine-cdr:3] ForkCDR("PJSIP/103a1-00000024", "e") in new stack
    -- Executing [s at subroutine-cdr:4] Set("PJSIP/103a1-00000024", "CDR(cdr_event_type)=11") in new stack
    -- Executing [s at subroutine-cdr:5] Set("PJSIP/103a1-00000024", "CDR(cdr_eventID)=58de7c34bfec4") in new stack
    -- Executing [s at subroutine-cdr:6] Set("PJSIP/103a1-00000024", "CDR(cdr_callerid_num)=103a1") in new stack
    -- Executing [s at subroutine-cdr:7] Set("PJSIP/103a1-00000024", "CDR(cdr_callerid_name)=x103 &") in new stack
    -- Executing [s at subroutine-cdr:8] Set("PJSIP/103a1-00000024", "CDR(cdr_record_file_location)=") in new stack
    -- Executing [s at subroutine-cdr:9] Set("PJSIP/103a1-00000024", "CDR(cdr_billable)=") in new stack
    -- Executing [s at subroutine-cdr:10] Return("PJSIP/103a1-00000024", "") in new stack
    -- Executing [vm103 at from-internal:2] StopMixMonitor("PJSIP/103a1-00000024", "") in new stack
    -- Executing [vm103 at from-internal:3] VoiceMailMain("PJSIP/103a1-00000024", "103 at voicemail,s") in new stack
  == Spawn extension (from-internal, vm103, 3) exited non-zero on 'PJSIP/103a1-00000024'
    -- Executing [h at from-internal:1] Set("PJSIP/103a1-00000024", "CDR(cdr_hangupcause)=16") in new stack
    -- Executing [h at from-internal:2] GotoIf("PJSIP/103a1-00000024", "0?txfax") in new stack
    -- Executing [h at from-internal:3] GotoIf("PJSIP/103a1-00000024", "0?rg") in new stack
    -- Executing [h at from-internal:4] GotoIf("PJSIP/103a1-00000024", "0?rg") in new stack
    -- Executing [h at from-internal:5] GotoIf("PJSIP/103a1-00000024", "0?queue") in new stack
    -- Executing [h at from-internal:6] Goto("PJSIP/103a1-00000024", "done") in new stack
    -- Goto (from-internal,h,28)
    -- Executing [h at from-internal:28] NoOp("PJSIP/103a1-00000024", "") in new stack
{code}
All calls after the first will start to hang at the first get var=CDR()
{code}
    -- Executing [*97 at from-internal:1] Goto("PJSIP/103a1-00000025", "vm103,1") in new stack
    -- Goto (from-internal,vm103,1)
    -- Executing [vm103 at from-internal:1] Gosub("PJSIP/103a1-00000025", "subroutine-cdr,s,1(11,58de7c34bfec4)") in new stack
---presumably it is hanging in line 1 of this subroutine which is Set(saverecpath=${CDR(cdr_record_file_location)})
{code}

I can start 3 calls which execute the voicemail dialplan as long as I don't hang any of them up.
When I hang them up, the first call to be hung up finishes the hangup handler but the rest stop after the first line
{code}
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: channel.c:2590 ast_softhangup_nolock: Soft-Hanging (0x80) up channel 'PJSIP/103a1-0000002d'
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: pbx_variables.c:381 ast_str_retrieve_variable: Result of 'HANGUPCAUSE' is '16'
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: pbx.c:2914 pbx_extension_helper: Launching 'Set'
    -- Executing [h at from-internal:1] Set("PJSIP/103a1-0000002d", "CDR(cdr_hangupcause)=16") in new stack
--the next lines are not not in the verbose debug log for the other calls being hung up, I think they are still stuck in the first line or maybe it can't finalise the CDR--
[Jan 25 00:13:11] DEBUG[39]: cdr.c:1296 cdr_object_finalize: Finalized CDR for PJSIP/103a1-0000002d - start 1516839184.775040 answer 1516839184.804170 end 1516839191.362747 dispo ANSWERED
same => n,GotoIf($["${CDR(cdr_event_type)}" = "2"]?txfax)
{code}
All of this seems to suggest that the first instance of a set CDR()= blocks all gets and sets thereafter.
{code}exten => h,1,Set(CDR(cdr_hangupcause)=${HANGUPCAUSE}){code}

I recompiled asterisk with BETTER_BACKTRACES and DEBUG_THREADS, but I still can not get ast_coredumper to write anything other than headers to the 4 files it creates.
There I can see in the verbose debug log that is any different from things that work, all I see is missing lines because dialplan stops working.

I can try to get some logs for you and figure out how to dump the core, but I think the quick fix for us right now might be to implement an HTTP API instead of ODBC for our CDR and dialplan func_odbc queries.

> Dialplan deadlock when connection to external SQL server is lost
> ----------------------------------------------------------------
>
>                 Key: ASTERISK-27615
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-27615
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: CDR/cdr_adaptive_odbc, Functions/func_cdr
>    Affects Versions: 14.6.0
>            Reporter: Jared Hull
>            Assignee: Unassigned
>            Severity: Critical
>
> We have a cluster of SQL servers for CDR and realtime states that are used in dialplan. Recently we had a single SQL server lose network connectivity, and all Asterisk instances which used this server as their primary started to hang in dialplan.
> If I stop the SQL service while leaving the server pingable, Asterisk will continue to work and simply return a few errors when the CDR is committed to the database.
> {code}
> res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1
> cdr_adaptive_odbc.c:436 odbc_log: cdr_adaptive_odbc: Unable to retrieve database handle for 'dev-dallas-sql1:cdr_event_log'.  CDR failed: INSERT INTO cdr_event_log (
> {code}
> If I 'service network stop' on the SQL server to simulate network failure, asterisk stops executing dialplan related to func_odbc and cdr_adaptive_odbc. It is as if it still thinks the SQL connection is there, and refuses to failover to another DSN in the case of func_odbc. cdr_adaptive_odbc doesn't even have failover connections (this would be a very useful feature) so I don't know what can be done about that, other than to skip CDR and throw an error.
> Dialplan to reproduce:
> {code}
> exten => 101,1,noop(${CDR(anything)})
> exten => 102,1,noop(${ODBC_blacklist_global(42)})
> {code}
> Example of cdr_adaptive_odbc.conf entry:
> {code}
> [default]
> connection=dev-dallas-sql1
> table=cdr_event_log
> {code}
> Example of func_odbc.conf entry:
> {code}
> [blacklist_global]
> dsn=dev-dallas-sql1,dev-dallas-sql2,dev-dallas-sql3
> readsql=SELECT COUNT(*) FROM blacklist_global WHERE cid_number='${SQL_ESC(${ARG1})}'
> {code}

--
This message was sent by Atlassian JIRA
(v6.2#6252)