[asterisk-dev] dahdi_device representation

Tilghman Lesher tlesher at digium.com
Sat Aug 28 09:21:17 CDT 2010


On Saturday 28 August 2010 07:14:38 Oron Peled wrote:
> On Saturday, 28 בAugust 2010 05:04:06 Tilghman Lesher wrote:
> > On Friday 27 August 2010 17:54:45 Oron Peled wrote:
> > > During Shaun Ruffell adventures with DAHDI persistant channel
> > > assignments he started implementing a very important feature (IMO) -- a
> > > representation of dahdi_device that represent a collection of spans.
> > >
> > > Lack of this representation caused hardware attributes to be
> > > duplicated in the spans while in reality they should be represented
> > > in the dahdi_device (e.g: location)
> > >
> > > I would like to use this opportunity ot present a relevant issue that
> > > may IMO affect the design:
> > >
> > > 1. Historically, chan_dahdi was not made for hot-pluggable
> > >    devices.
> > >
> > > 2. As a result. After a successfull open(), chan_dahdi ignore
> > > read()/write() errors (except for the special errno used to pass
> > > events).
> > >
> > > 3. This means that if a device is removed under chan_dahdi feet it
> > >    goes to an infinite tight failed read() loop which usually make the
> > >    host unresponsive after a few seconds (except of the kernel)
> > >    because asterisk usually runs at real-time priority.
> > >
> > > 4. Since Astribanks were always hot-plugabble, we "solved" this problem
> > >    by employing various measures in our xpp drivers:
> > >    - When a device is removed, we *keep* its data structure intact and
> > >       make a note to ourselves that it's disconnected.
> > >    - We send a red alarm to asterisk for disconnected devices, trying
> > >       to squelch some of the "noise".
> > >    - We ignore asterisk calls for disconnected devices.
> > >    - We added a "REMOVED" event to asterisk, politely asking it to
> > > remove a span with all its channels.
> > >    - We refcount the opne/close so if/when asterisk is nice and
> > > actually close all channels, we can actually release the data
> > > structures.
> > >
> > >    BTW: only lately (during dial-byname development) we managed to fix
> > >            asterisk so removing a digital span would also close its
> > > dchan.
> > >
> > > 5. Obviously, keeping "ghost" devices around so we don't surprise
> > > asterisk is not a very good design, but we didn't see any alternatives
> > > at the time.
> > >
> > > If chan_dahdi is not made aware to driver errors (e.g: -ENODEV),
> > > similar ugly techniques would be needed for hot-plug implementation at
> > > the DAHDI level. This has some design consequences for the sysfs object
> > > layout and therefore should be thought about early.
> > >
> > > So the question is short:
> > >    Should DAHDI account for and work around chan_dahdi ignorance?
> > >    Or should chan_dahdi be fixed first?
> >
> > Yes, DAHDI will need to work around this, since we cannot ensure that
> > each Asterisk installation will upgrade the userland piece to a version
> > which is sufficient to work around the problem.  One question, though. 
> > If this is fixed in both locations, what method would you prefer to
> > communicate that chan_dahdi has been fixed, and DAHDI doesn't need to
> > employ the work around?  Or would you prefer to simply keep the
> > workaround active in DAHDI regardless of whether it is necessary for
> > chan_dahdi?  Perhaps it would be sufficient to detect the poor behavior
> > (multiple successive read()s which fail) and employ the workaround only
> > in that case.
>
> 0. Keeping production systems working is a given. But let's look at
>    some considerations.
>
> 1. First, regardless of the workarounds we may implement in DAHDI,
>    This is a bug that exist in all asterisk installations and should be
>    fixed anyway. Here is one manifestation of this bug:
>       https://issues.asterisk.org/view.php?id=17669

We do not disagree here.  My preference would be for all users to upgrade to
the latest branch and keep up with bugfixes.  However, our users are on their
own schedules, some of whom prefer to run older releases, because they've
adapted, even depended, upon the behavior of certain bugs.

> 2. If the fix in chan_dahdi is small/simple (haven't looked at the code
>    yet) than there's no reason not to apply it for all maintained
>    asterisk versions (including 1.4.x, 1.6.x). This may significantly
>    reduce the time needed to maintain ugly solution (e.g: from
>    infinity to 1 year ;-)

Yes, this is not in question.  What is in question is how long people will
continue to run older versions on purpose.  It will likely continue beyond the
point where we stop supporting a particular branch for security issues, which,
even for the 1.4 branch, is going to be a span of several years.

> 3. I think the most prominent effect of this workaround is that it change
>    the lifecycle of sysfs data structures (the inability to free the device
>    data structure on time). So I think we will not gain anything
> significant by trying to detect bugy asterisk on run-time.
>
> 4. It would be nice if we can delineate this code with #ifdefs so we can
>    remove it later, but I'm not sure if that would be easy (again, due
>    to its structural nature).
>
> 5. The idea of run-time checks may be usefull in a later phase when
>    we want to urge users to update their installation to a non-bugy
>    version.

All versions are buggy, but some versions are less buggy than others.  Persons
who are depending upon certain buggy behavior for their own purposes will
generally refuse to upgrade to versions which no longer contain their
particular adopted bugs.  This isn't a problem that we can wave a magic wand
over and fix the problem permanently by merely fixing the userland piece.
Because users are more apt to upgrade kernels and their corresponding kernel
drivers than their customized applications, we have to ensure that our kernel
driver revisions don't break the older userland pieces.

By all means, please fix the userland piece going forward.  But we cannot
depend upon the userland getting upgraded when the kernel driver is upgraded.
So the kernel driver MUST be able to handle both scenarios for a very long
time.

-- 
Tilghman Lesher
Digium, Inc. | Senior Software Developer
twitter: Corydon76 | IRC: Corydon76-dig (Freenode)
Check us out at: www.digium.com & www.asterisk.org



More information about the asterisk-dev mailing list