[asterisk-dev] Better pattern matching

Thu Aug 2 08:13:51 CDT 2007

On Wed, 2007-08-01 at 18:42 -0500, John Lange wrote: 
> Steve, first let me say that your work at measuring the performance
> issues surrounding the existing system in Asterisk is quite amazing.
> Good work.
> 
> At first blush the mgcp RFC would seem like the natural choice because
> it is already implemented in most SIP devices, it's much easier to learn
> than regexp, and as such it is something that many VOIP admins would
> already be familiar with.
> 
> Unfortunately, as you mentioned it does not cover alpha so it is still
> not a comprehensive answer but given unlimited time and resources I
> think it should be one of the options for pattern matching. However the
> priority should still be to do regexp first. For one thing PCRE already
> exists as a library that can be used so it should be easier implement.
> 
> Given some of your comments below I would just like to reiterate that I
> don't think _anything_ should be done to the current system. New systems
> of pattern matching should be added using different designations to
> trigger different types of matching.
> 
> So for example; the existing system uses underscore:
> 
> exten => _1NXXNXXXXXX,1, ...
> 
> mgcp could use brackets:
> 
> exten => (0T| 00T|1xxxxxxxxxx.T),1, ...
> 
> regexp could use its traditional slashs:
> 
> exten => /1[2-9][0-9]+/,1, ...
> 
> etc.

I like the idea of using different notations to allow different
pattern matching algorithms.

But, if we want Asterisk to step up into a higher volume use arena,
we absolutely have to re-do the extension pattern matching algorithm
to avoid linear searching. There are ramifications, as I said before,
to things like realtime, where multiple database reads need to be done
to select a proper extension.

I was about to go on and on about scoring matches, but the regex libs
will only give us a set of patterns that match the given string. These
patterns can be assigned scores when they are compiled. I think. I'll
have to check on that.
At least, I hope the scoring is based on the patterns, not on the string
they match. 

> 
> > In order to speed up the algorithm, and  make response more flat in
> > asterisk, no matter the size or composition of the dialplan, I
> > resorted to using hash tables
> 
> Does this imply that this is code that is already in Asterisk?

It's in a branch (fast-ast), and needs some work to be useful in
Asterisk in general. What I demonstrated is speedups (really
anti-slowdowns) in general in-mem dial-plan usage. But, it can't be used
with realtime, where the priorities are stored in a db. We've tossed
around how we could speed up realtime, but it involves a different
usage/ API for realtime. So, until we re-do realtime, we have to keep
the current algorithm around.

I think we could implement a reasonable subset of regex in the current
environment, and not have to completely re-engineer the matchers. To
avoid changing the current pattern syntax, we could use your suggestion
for /pattern/ type notation. That way, we get the best of both worlds,
some more powerful pattern matching, without sacrificing the speed of a
simpler tree-based pattern representation. 

T is not really possible in the current regime; the pattern matchers are
called by code that already factors timeouts. So even the simpler mgcp
syntax would have to be subset'd.

Here's what I THINK I could provide in the way of a regex-subset, and
still use the tree algorithm to speed up the match:

First of all, regexps implicitly begin with ^ (start of line), and end
with $ (end of line). I don't picture just looking for patterns
somewhere in the string.

[xyz] character class (simple)  (also called "bracket expression")
[a-zA-Z] character class (range(s))
[^xyz]   anti-char class
[-xyz] include the - in the class
[^-xyz] anything but the -,x,y,z
(I'm not wild about providing []], [^]],  [=x=], [:whatever:] or [.xyz.]
notation... I doubt it necessary)

I could allow the use of *, +, ? for single chars and be OK.
a*        0 or more a's
a+        1 or more a's
a?        0 or 1 a's.
[0-9]+    1 or more digits.
[^0-9]*   0 or more non-digits

...because we just advance to the next char if it's not there, and 0 are
allowed, and we iterate in place when more than 1 are allowed.

The OR-notation could be acceptable:
nancy|jim|james  will match "nancy", "jim", or "james".

. would be ok to mean any one char; it'd stand for a char class with
everything in it.

We could use {3,}, {,3}, {3}, and {3,5} notation also; when we iterate
on a repetition, we can keep counts and terminate the match the moment
the count is violated.

Paren grouping is a bit problematic. I'd prefer to not allow parens in
the regex.

Trailing context: stuff like .+ABC  or .+X[yz]+ probably won't work like
you hoped. If you are looking for something that ends in ABC, hopefully,
the char in front of that is something you can match against, or A won't
appear anywhere else, so you can say [^A]+ABC.  Trying to do trailing
context is kinda expensive,
you need to recursively call the pattern matcher on the remaining part
of the string. Ouch. Yes! I'm concerned about CPU cycles! When you have
a 1000 calls coming in every minute, we can't afford to be recursively
searching for a pattern. In most cases with extensions, you know how
long the string will be; if you want only something ending with xyz,
then say .{7}xyz and be done with it.

Scoring: I could use a similar algorithm to what I'm using in the tree
algorithm, just fit in the extra patterns. In general, the more general
the pattern, the less "best" it is. So, a direct match char is best, the
bigger the char range, the lesser its score, etc. IIRC, 3072504105 would
beat [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] would beat
[0-9]{10} would beat [0-9]+

So, the question devolves to: what features of regex would you really
like to have, given the above? What have I thrown out that you would
really like?

This stuff, added to the tree matcher I have, will mercilessly
complicate it, but I think I could make it work.

murf

> 
> John
> 
> On Wed, 2007-08-01 at 17:03 -0600, Steve Murphy wrote:
> > On Wed, 2007-08-01 at 15:15 -0400, Clod Patry wrote:
> > > you probably are searching for RFC 2705 ?
> > > 
> > > That would be great if Asterisk could support that kind of
> > > digitmaps/patterns.
> > > 
> > 
> > I've got this rfc in front of me now... it's the mgcp spec.
> > 
> > DigitMap = DigitString  / "(" DigitStringList ")"
> > DigitStringList = DigitString 0*( "|" DigitString )
> > DigitString = 1*(DigitStringElement)
> > DigitStringElement = DigitPosition ["."]
> > DigitPosition = DigitMapLetter / DigitMapRange
> > DigitMapLetter = DIGIT / "#" / "*" / "A" / "B" / "C" / "D" / "T"
> > DigitMapRange =  "x" / "[" 1*DigitLetter "]"
> > DigitLetter ::= *((DIGIT "-" DIGIT ) / DigitMapLetter)
> > 
> > Example:
> >      (0T| 00T|[1-7]xxx|8xxxxxxx|#xxxxxxx|*xx|91xxxxxxxxxx|9011x.T)
> > 
> > The use of '|' to "or" together multiple choices would just be shorthand
> > for more extensions. This would not affect my algorithm, but will
> > slowdown the original algorithm, by adding another choice to the list,
> > to test.
> > 
> > It's just shorthand for defining several extensions with the same
> > contents.
> > 
> > The use of the Timeout (T) stuff-- uh, I'd rather not think about that
> > kinda thing.
> > 
> > The mgcp spec doesn't really cover the full alphanumeric range that
> > you'd like to cover (Jared's Wish).
> > 
> > The trouble with doing full-alphanumeric pattern matching, is that we
> > already use X and Z for matching digits.... otherwise, we could allow
> > stuff like _XerceZ[a-zA-Z0-9]. Maybe we could trade X for @ and Z for %
> > or somesuch and get back those letters! And the - in character groups --
> > we need to allow similar syntax like _XerceZ[-a-ZA-Z0-9], to allow
> > dashes.
> > 
> > Next,  variable length stuff. We could use some sort of notation to
> > allow wildcarding any length of string if we have a clear idea of what
> > ends it.
> > The example of _0X.#, where any number of non-# chars followed by #
> > could be workable in both the old and new pattern matchers, without
> > resorting to regex's.
> > Even _0X.#1X could be possible. If one char can come after a variable
> > len 
> > string, then why not more?
> > 
> > Let's see... (123)* means 0 or more 123 patterns. This would be
> > difficult, as it forms loops in the matcher. Again, a matter for a
> > state-machine matcher. Same 
> > occurs with (123)+, 1 or more 123 patterns.
> > 
> > Repetition notation like (123){0,5} (0 to 5 repetition of 123) explode
> > out to
> > a choice of fixed patterns in the matcher: this would be equiv to 
> > nothing | 123 | 123123 | 123123123 | 123123123123 | 123123123123123 
> > 
> > The current algorithm would have to be re-engineered for some of this,
> > and so would the speedup algorithm, but at least some stuff could be
> > added that wouldn't require resorting to regex packages.
> > 
> > murf
> > 
> > 
> > > 
> > > On 8/1/07, Jared Smith <jsmith at digium.com> wrote:
> > >         On Wed, 2007-08-01 at 11:44 -0500, John Lange wrote: 
> > >         > Searching the archives I can see that better pattern
> > >         matching has been
> > >         > discussed a number of times.
> > >         >
> > >         > There have even been patches submitted and discussions about
> > >         how there
> > >         > are regexp libraries available under BSD style license that
> > >         could be 
> > >         > incorporated into Asterisk.
> > >         >
> > >         > Yet, nothing ever makes it into the released code.
> > >         
> > >         Yes, I've been one of the people who have been more vocal than
> > >         most
> > >         about this.  I really think it's a good idea to be a little
> > >         more 
> > >         flexible in our pattern matching ability.  That being said, I
> > >         can see
> > >         where the Asterisk developers are coming from too -- they're
> > >         worried
> > >         that this will severely impact the time it takes for Asterisk
> > >         to do
> > >         pattern matching.  The current pattern matching syntax works
> > >         fairly well
> > >         for numeric extensions, but is pretty difficult work with if
> > >         you trying
> > >         to do pattern matching on names or alphanumeric
> > >         extensions.  For
> > >         example, try coming up with a pattern match that'll match the
> > >         name Nancy 
> > >         (either upper or lower case) followed by two digits.  It
> > >         becomes even
> > >         more difficult when dealing with letters that fall outside the
> > >         English
> > >         alphabet.
> > >         
> > >         While I personally think regular expressions are a pretty cool
> > >         idea, I 
> > >         know not everyone is sold on them.  If I remember correctly,
> > >         there's
> > >         even an RFC with a very complete pattern matching syntax
> > >         defined (which
> > >         I can't find right now, of course).
> > >         
> > >         Hopefully, one of the Asterisk developers will stumble across
> > >         this post 
> > >         and take pity on us and help us out!
> > >         
> > >         
> > >         --
> > >         Jared Smith
> > >         Community Relations Manager
> > >         Digium, Inc.
> > >         
> > >  
-- 
Steve Murphy
Software Developer
Digium
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3227 bytes
Desc: not available
Url : http://lists.digium.com/pipermail/asterisk-dev/attachments/20070802/d07ca379/attachment-0001.bin