[asterisk-dev] What happened with the latest round of releases: or, "whoops"

Fri Jun 13 09:42:20 CDT 2014

On Fri, Jun 13, 2014 at 4:41 AM, Steven Howes <steve-lists at geekinter.net>
wrote:

> On 13 Jun 2014, at 08:12, Matthew Jordan <mjordan at digium.com> wrote:
> > Apologies if this e-mail gets a bit rambling; by the time I send this it
> will be past 2 AM here in the US and we've been scrambling to fix the
> regression caused by r415972 without reintroducing the vulnerability it
> fixed for the past 9 hours or so.
> >
> > Clearly, there are things we should have done better to catch this
> before the security releases went out yesterday. The regression was serious
> enough that plenty of tests in the Test Suite caught the error - in fact,
> development of a test on a local dev machine was how we discovered that the
> regression had occurred.
>
> I’ve not been directly involved with the whole commit/testing procedure,
> so excuse me if I’m misreading anything..
>
> If it fails the tests, how was it released? I understand the whole reduced
> transparency/communications thing, it’s an unfortunate necessity of dealing
> with security issues. I can’t see how that excludes the testing carried out
> by the Test Suite though?
>
> Kind regards,
>
>
Disregarding local test suite runs, a few things happened here:

(1) Four security patches were made at roughly the same time.
Unfortunately, the patch with the issue was the last one to get committed -
and by the time that occurred, there were a large number of jobs scheduled
in front of it.

(2) The order of execution of jobs in Bamboo is the following:
     (a) Basic build (simple compile test) on first available build agent =>
     (b) Full build (multiple compile options, e.g., parallel builds) on
all different flavors of build agent =>
     (c) Unit test run =>
     (d) Channel driver tests in the Test Suite =>
     (e) ARI tests in the Test Suite
    Nightly, a full run of the test suite takes place.

    This issue would have been caught by step (d) - but each of the
previous steps takes awhile to complete (Asterisk doesn't compile quickly).
A test suite run takes a long time - even with the reduced sets of tests in
steps (d) and (e). Each merge in a branch causes this process to kick off -
and there were at least 7 iterations of this in front of it. Which leads to
point #3:

(3) The merge process on the offending patch was slowed down due to merge
conflicts between branches. The merging of the patch into all branches
wasn't complete until nearly 3 PM, which meant we had very little time to
get the releases out - generally, we strive hard to get the security
releases out the door as early as possible, so system administrators have
time that day to upgrade their systems if they are affected.

All of that aside, there's a few things (again, beyond running the test
suite locally) that could be done to improve the situation:

(a) Add a 'smoke test' to the Test Suite that gets run either in the Basic
Build or Full Build steps. This would do some very simple things: originate
a call over AMI with a Local channel, use a SIP channel to connect to
another instance of Asterisk, pass media/DTMF, bounce back to the test
using AGI, and maybe a few other things. Such a test could hit a lot of our
normal 'hot spots' and - if run early enough in the cycle - would flag
developers quicker than the current process.

(b) Throw some more hardware at the problem. Right now, we have a single
32-bit/64-bit CentOS 6 machine - we could easily double that up, which
would get results faster.

-- 
Matthew Jordan
Digium, Inc. | Engineering Manager
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at: http://digium.com & http://asterisk.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20140613/955fc80f/attachment.html>