<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    El 31/01/12 12:58, Alex Villac&iacute;&shy;s Lasso escribi&oacute;:

    <blockquote cite="mid:4F282BAD.8070903@palosanto.com" type="cite">

      <meta http-equiv="content-type" content="text/html;

        charset=ISO-8859-1">

      <div class="moz-text-flowed" style="font-family: -moz-fixed;

        font-size: 14px;" lang="x-western">I am having problems with a

        deadlock in Asterisk 1.8.9.0. <br>

        <br>

        The system is a x86_64 machine that is being used as a

        callcenter. The agents log in via the AgentLogin application,

        and each Agent/XXXX channel is assigned to one or more queues. A

        custom separate process generates calls into the queues for the

        agents to answer. The calls all go out through a SIP trunk, and

        all of the agent extensions are SIP. After an hour or so,

        asterisk deadlocks. Any attempt to run "agent show" or "agent

        show online" through the console hangs. Also, AMI events seem to

        stop. However, the users seem to be still connected, only they

        do not receive calls anymore (the custom process waits forever

        for the Originate response). The deadlock is apparently

        spontaneous - there is no explicit action taken by the

        administrator that seems to induce the issue. I will try to make

        sense of the attached traces, but I hope someone on the list

        could provide a clue on what to look for. <br>

        <br>

        Backtraces attached to

        <meta http-equiv="content-type" content="text/html;

          charset=ISO-8859-1">

        <a moz-do-not-send="true"

          href="https://issues.asterisk.org/jira/browse/ASTERISK-19285">https://issues.asterisk.org/jira/browse/ASTERISK-19285</a></div>

    </blockquote>

    I have marked this bug as a regression because the patch that is

    supposed to fix <a

      href="https://issues.asterisk.org/jira/browse/ASTERISK-18092">https://issues.asterisk.org/jira/browse/ASTERISK-18092</a>

    is probably the one that introduced this bug.<br>

    <br>

    This is what I believe is happening:

    <div class="action-body flooded">

      <p>The user is running a script that periodically invokes the AMI

        action "Agents", which is handled by action_agents() in

        channels/chan_agent.c:1499. This function traverses the agent

        list, and for each one first takes a lock on struct agent_pvt *p

        (chan_agent.c:1516), then attempts to take a lock on p-&gt;owner

        (a channel of type Agent, I think) at chan_agent.c:1534, in

        order to check whether this is a bridged channel. This second

        lock is the one that is introduced by the patch that "fixes" <a

          href="https://issues.asterisk.org/jira/browse/ASTERISK-18092"

          title="asterisk segfault libpthread-2.9.so"><del>ASTERISK-18092</del></a>.</p>

      <p>Meanwhile, in another thread, some frames need to be written to

        the Agent/xxxx channel, at ast_write() in main/channel.c:4767 .

        In channel.c:4774, a lock is taken on the channel (which happens

        to be the one at p-&gt;owner), and then the tech-specific write

        method is invoked at channel.c:5032. For Agent channels, this

        method is agent_write() at channels/chan_agent.c:691. This

        method extracts tech_pvt from the channel (which happens to be

        an agent_pvt, the one picked up in the other thread at line

        1516), then attempts to take a lock on it. Therefore, a

        deadlock.</p>

      <p>I was about to perform what amounts to a revert of the fix for

        <a href="https://issues.asterisk.org/jira/browse/ASTERISK-18092"

          title="asterisk segfault libpthread-2.9.so"><del>ASTERISK-18092</del></a>,

        but then I looked in the CHANGELOG and realized this other

        issue. Also, I found an inconsistency in the handling of the

        Agents action as compared to the commands "agent show" and

        "agent show online". If the thread holds a lock on agent_pvt,

        and should then take a lock on agent_pvt-&gt;owner, then

        agents_show() at line 1702 and agents_show_online() at line 1771

        must have the same lock taken in order to be consistent with the

        "fix" for action_agents(). On the other hand, I originally

        decided to revert the change in action_agents() in order to make

        the function consistent with agents_show() and

        agents_show_online() which do not take the lock.</p>

    </div>

    So, which lock order is the correct one? Should the code at

    chan_agent.c release the lock on agent_pvt before taking the lock on

    the owner channel? Should the ast_channel_lock() at

    chan_agent.c:1534 be replaced by a call to ast_channel_trylock(), as

    used by main/channel.c ? Other ideas?<br>

  </body>

</html>