[asterisk-dev] dahdi_device representation

Sun Aug 29 00:59:21 CDT 2010

On 8/28/10 7:14 AM, Oron Peled wrote:
> On Saturday, 28 בAugust 2010 05:04:06 Tilghman Lesher wrote:
>> On Friday 27 August 2010 17:54:45 Oron Peled wrote:
>>> During Shaun Ruffell adventures with DAHDI persistant channel assignments
>>> he started implementing a very important feature (IMO) -- a representation
>>> of dahdi_device that represent a collection of spans.
>>>
>>> Lack of this representation caused hardware attributes to be
>>> duplicated in the spans while in reality they should be represented
>>> in the dahdi_device (e.g: location)
>>>
>>> I would like to use this opportunity ot present a relevant issue that
>>> may IMO affect the design:
>>>
>>> 1. Historically, chan_dahdi was not made for hot-pluggable
>>>     devices.
>>>
>>> 2. As a result. After a successfull open(), chan_dahdi ignore
>>> read()/write() errors (except for the special errno used to pass events).
>>>
>>> 3. This means that if a device is removed under chan_dahdi feet it
>>>     goes to an infinite tight failed read() loop which usually make the
>>>     host unresponsive after a few seconds (except of the kernel)
>>>     because asterisk usually runs at real-time priority.
>>>
>>> 4. Since Astribanks were always hot-plugabble, we "solved" this problem
>>>     by employing various measures in our xpp drivers:
>>>     - When a device is removed, we *keep* its data structure intact and
>>>        make a note to ourselves that it's disconnected.
>>>     - We send a red alarm to asterisk for disconnected devices, trying
>>>        to squelch some of the "noise".
>>>     - We ignore asterisk calls for disconnected devices.
>>>     - We added a "REMOVED" event to asterisk, politely asking it to remove
>>>        a span with all its channels.
>>>     - We refcount the opne/close so if/when asterisk is nice and actually
>>> close all channels, we can actually release the data structures.
>>>
>>>     BTW: only lately (during dial-byname development) we managed to fix
>>>             asterisk so removing a digital span would also close its dchan.
>>>
>>> 5. Obviously, keeping "ghost" devices around so we don't surprise asterisk
>>>      is not a very good design, but we didn't see any alternatives at the
>>> time.
>>>
>>> If chan_dahdi is not made aware to driver errors (e.g: -ENODEV), similar
>>> ugly techniques would be needed for hot-plug implementation at the DAHDI
>>> level. This has some design consequences for the sysfs object layout and
>>> therefore should be thought about early.
>>>
>>> So the question is short:
>>>     Should DAHDI account for and work around chan_dahdi ignorance?
>>>     Or should chan_dahdi be fixed first?
>>
>> Yes, DAHDI will need to work around this, since we cannot ensure that each
>> Asterisk installation will upgrade the userland piece to a version which is
>> sufficient to work around the problem.  One question, though.  If this is
>> fixed in both locations, what method would you prefer to communicate that
>> chan_dahdi has been fixed, and DAHDI doesn't need to employ the work
>> around?  Or would you prefer to simply keep the workaround active in DAHDI
>> regardless of whether it is necessary for chan_dahdi?  Perhaps it would be
>> sufficient to detect the poor behavior (multiple successive read()s which
>> fail) and employ the workaround only in that case.
>
> 0. Keeping production systems working is a given. But let's look at
>     some considerations.
>
> 1. First, regardless of the workarounds we may implement in DAHDI,
>     This is a bug that exist in all asterisk installations and should be
>     fixed anyway. Here is one manifestation of this bug:
>        https://issues.asterisk.org/view.php?id=17669
>
> 2. If the fix in chan_dahdi is small/simple (haven't looked at the code
>     yet) than there's no reason not to apply it for all maintained
>     asterisk versions (including 1.4.x, 1.6.x). This may significantly
>     reduce the time needed to maintain ugly solution (e.g: from
>     infinity to 1 year ;-)
>
> 3. I think the most prominent effect of this workaround is that it change
>     the lifecycle of sysfs data structures (the inability to free the device
>     data structure on time). So I think we will not gain anything significant
>     by trying to detect bugy asterisk on run-time.
>

Would implementing something along the lines of what Tilghman suggested 
suffice?  If the goal is to prevent an unresponsive system, could we 
sleep for 5 milliseconds before returning ENODEV in order prevent the 
unresponsive system without affecting the lifetime of the sysfs objects?

Something like the second attachment (just added) on 
https://issues.asterisk.org/view.php?id=17669?

-- 
Shaun Ruffell
Digium, Inc. | Linux Kernel Developer
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at: www.digium.com & www.asterisk.org