[asterisk-dev] dahdi_device representation
Shaun Ruffell
sruffell at digium.com
Sun Aug 29 00:59:21 CDT 2010
On 8/28/10 7:14 AM, Oron Peled wrote:
> On Saturday, 28 בAugust 2010 05:04:06 Tilghman Lesher wrote:
>> On Friday 27 August 2010 17:54:45 Oron Peled wrote:
>>> During Shaun Ruffell adventures with DAHDI persistant channel assignments
>>> he started implementing a very important feature (IMO) -- a representation
>>> of dahdi_device that represent a collection of spans.
>>>
>>> Lack of this representation caused hardware attributes to be
>>> duplicated in the spans while in reality they should be represented
>>> in the dahdi_device (e.g: location)
>>>
>>> I would like to use this opportunity ot present a relevant issue that
>>> may IMO affect the design:
>>>
>>> 1. Historically, chan_dahdi was not made for hot-pluggable
>>> devices.
>>>
>>> 2. As a result. After a successfull open(), chan_dahdi ignore
>>> read()/write() errors (except for the special errno used to pass events).
>>>
>>> 3. This means that if a device is removed under chan_dahdi feet it
>>> goes to an infinite tight failed read() loop which usually make the
>>> host unresponsive after a few seconds (except of the kernel)
>>> because asterisk usually runs at real-time priority.
>>>
>>> 4. Since Astribanks were always hot-plugabble, we "solved" this problem
>>> by employing various measures in our xpp drivers:
>>> - When a device is removed, we *keep* its data structure intact and
>>> make a note to ourselves that it's disconnected.
>>> - We send a red alarm to asterisk for disconnected devices, trying
>>> to squelch some of the "noise".
>>> - We ignore asterisk calls for disconnected devices.
>>> - We added a "REMOVED" event to asterisk, politely asking it to remove
>>> a span with all its channels.
>>> - We refcount the opne/close so if/when asterisk is nice and actually
>>> close all channels, we can actually release the data structures.
>>>
>>> BTW: only lately (during dial-byname development) we managed to fix
>>> asterisk so removing a digital span would also close its dchan.
>>>
>>> 5. Obviously, keeping "ghost" devices around so we don't surprise asterisk
>>> is not a very good design, but we didn't see any alternatives at the
>>> time.
>>>
>>> If chan_dahdi is not made aware to driver errors (e.g: -ENODEV), similar
>>> ugly techniques would be needed for hot-plug implementation at the DAHDI
>>> level. This has some design consequences for the sysfs object layout and
>>> therefore should be thought about early.
>>>
>>> So the question is short:
>>> Should DAHDI account for and work around chan_dahdi ignorance?
>>> Or should chan_dahdi be fixed first?
>>
>> Yes, DAHDI will need to work around this, since we cannot ensure that each
>> Asterisk installation will upgrade the userland piece to a version which is
>> sufficient to work around the problem. One question, though. If this is
>> fixed in both locations, what method would you prefer to communicate that
>> chan_dahdi has been fixed, and DAHDI doesn't need to employ the work
>> around? Or would you prefer to simply keep the workaround active in DAHDI
>> regardless of whether it is necessary for chan_dahdi? Perhaps it would be
>> sufficient to detect the poor behavior (multiple successive read()s which
>> fail) and employ the workaround only in that case.
>
> 0. Keeping production systems working is a given. But let's look at
> some considerations.
>
> 1. First, regardless of the workarounds we may implement in DAHDI,
> This is a bug that exist in all asterisk installations and should be
> fixed anyway. Here is one manifestation of this bug:
> https://issues.asterisk.org/view.php?id=17669
>
> 2. If the fix in chan_dahdi is small/simple (haven't looked at the code
> yet) than there's no reason not to apply it for all maintained
> asterisk versions (including 1.4.x, 1.6.x). This may significantly
> reduce the time needed to maintain ugly solution (e.g: from
> infinity to 1 year ;-)
>
> 3. I think the most prominent effect of this workaround is that it change
> the lifecycle of sysfs data structures (the inability to free the device
> data structure on time). So I think we will not gain anything significant
> by trying to detect bugy asterisk on run-time.
>
Would implementing something along the lines of what Tilghman suggested
suffice? If the goal is to prevent an unresponsive system, could we
sleep for 5 milliseconds before returning ENODEV in order prevent the
unresponsive system without affecting the lifetime of the sysfs objects?
Something like the second attachment (just added) on
https://issues.asterisk.org/view.php?id=17669?
--
Shaun Ruffell
Digium, Inc. | Linux Kernel Developer
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at: www.digium.com & www.asterisk.org
More information about the asterisk-dev
mailing list