[asterisk-bugs] [JIRA] (ASTERISK-28888) res_corosync: causes asterisk crash in huge distributed environment.

Asterisk Team (JIRA) noreply at issues.asterisk.org
Tue May 12 09:34:25 CDT 2020


     [ https://issues.asterisk.org/jira/browse/ASTERISK-28888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Asterisk Team updated ASTERISK-28888:
-------------------------------------

    Assignee: Asterisk Team  (was: Università di Bologna - CESIA VoIP)
      Status: Triage  (was: Waiting for Feedback)

> res_corosync: causes asterisk crash in huge distributed environment.
> --------------------------------------------------------------------
>
>                 Key: ASTERISK-28888
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28888
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_corosync
>    Affects Versions: 13.22.0
>         Environment: FreePBX 14
>            Reporter: Università di Bologna - CESIA VoIP
>            Assignee: Asterisk Team
>            Severity: Minor
>         Attachments: res_corosync.diff
>
>
> The VOIP infrastructure of the University of Bologna is a distributed system based on FreePBX: currently it's composed by 8 FreePBX server running a custom module developed internally that implement the high availability of the system and keep the pbxes synchronized. We have about 5300 sip identities and they will grow to 8000 in the coming months.
> We are using res_corosync asterisk module to synchronize the device states and MWI states across the pbxes, but due to the large number of sip identities in our system, we encountered some problems.
> We developed a patch to the res_corosync module with some changes needed to make it work in a huge distributed environment.
> 1) Fix memory-leaks
>    Added code to release ast_events extracted from corosync and stasis messages
> 2) Clean stasis cache when a member of the corosync cluster leaves the group
>    Added code to remove from the stasis cache of the members remained on the group all the messages with the EID of the left member.
>    If the device states of the left member remain in the stasis cache of other members, they will not be updated anymore and high priority cached values, like BUSY, will take precedence over current device states. 
> 3) Stop corosync event propagation when node is not joined to the group
>    Updated dispatch_thread_handler code to detect when asterisk is not joined to the corosync group and added some condition in publish_event_to_corosync code to send corosync messages only when joined.
>    When a node is not joined its corosync daemon can't send messages: the cpg_mcast_joined function append new messages to the FIFO buffer until it's full and then it blocks indefinitely.
>    In this scenario if the stasis_message_cb callback, registered by res_corosync to handle stasis messages, try to send a corosync messages, the thread of the stasis thread-pool will be blocked until the node join the corosync cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list