[asterisk-bugs] [JIRA] (ASTERISK-28888) res_corosync: causes asterisk crash in huge distributed environment.

Mon Jun 22 13:12:25 CDT 2020

    [ https://issues.asterisk.org/jira/browse/ASTERISK-28888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=251200#comment-251200 ] 

Friendly Automation commented on ASTERISK-28888:
------------------------------------------------

Change 14602 merged by Kevin Harwell:
res_corosync: Fix crash in huge distributed environment.

[https://gerrit.asterisk.org/c/asterisk/+/14602|https://gerrit.asterisk.org/c/asterisk/+/14602]

> res_corosync: causes asterisk crash in huge distributed environment.
> --------------------------------------------------------------------
>
>                 Key: ASTERISK-28888
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28888
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_corosync
>    Affects Versions: 13.22.0
>         Environment: FreePBX 14
>            Reporter: Università di Bologna - CESIA VoIP
>            Assignee: Unassigned
>            Severity: Minor
>              Labels: patch
>         Attachments: res_corosync.diff
>
>
> The VOIP infrastructure of the University of Bologna is a distributed system based on FreePBX: currently it's composed by 8 FreePBX server running a custom module developed internally that implement the high availability of the system and keep the pbxes synchronized. We have about 5300 sip identities and they will grow to 8000 in the coming months.
> We are using res_corosync asterisk module to synchronize the device states and MWI states across the pbxes, but due to the large number of sip identities in our system, we encountered some problems.
> We developed a patch to the res_corosync module with some changes needed to make it work in a huge distributed environment.
> 1) Fix memory-leaks
> Added code to release ast_events extracted from corosync and stasis messages
> 2) Clean stasis cache when a member of the corosync cluster leaves the group
> Added code to remove from the stasis cache of the members remained on the group all the messages with the EID of the left member.
> If the device states of the left member remain in the stasis cache of other members, they will not be updated anymore and high priority cached values, like BUSY, will take precedence over current device states.
> 3) Stop corosync event propagation when node is not joined to the group
> Updated dispatch_thread_handler code to detect when asterisk is not joined to the corosync group and added some condition in publish_event_to_corosync code to send corosync messages only when joined.
> When a node is not joined its corosync daemon can't send messages: the cpg_mcast_joined function append new messages to the FIFO buffer until it's full and then it blocks indefinitely.
> In this scenario if the stasis_message_cb callback, registered by res_corosync to handle stasis messages, try to send a corosync messages, the thread of the stasis thread-pool will be blocked until the node join the corosync cluster.
> This is still a work in progress as we haven't solved all the issues: in a huge distributed environment, like our, some problems occasionally occur yet:
> 1) When the delivering of a device state to the corosync group failed, without cluster membership changes, that device state is no more propagated to the other pbxes until it changes one other time or a node join the cluster.
> 2) The method cpg_mcast_joined of the corosync library blocks the calling thread until the message is delivered to the local corosync daemon. Under some circumstances, the local corosync daemon is unable to receive the message from the corosync library and the calling thread blocks indefinitely. The cpg_mcast_joined call is inside a critical section guarded by a lock and the same lock protects the code that reinitialize the connection of the corosync library to the corosync daemon inside the res_corosync module: when a thread is blocked inside the cpg_mcast_joined call, res_corosync is unable to detect corosync daemon failures and to reinitialize the connection. It's also not possible to unload the res_corosync module as the blocked thread is locking the module shared library.

--
This message was sent by Atlassian JIRA
(v6.2#6252)