[asterisk-dev] Better pattern matching

Thu Aug 2 11:55:10 CDT 2007

On Thu, 2007-08-02 at 07:13 -0600, Steve Murphy wrote:

> I like the idea of using different notations to allow different
> pattern matching algorithms.

This seem like the way to go. BTW, reading forward in the thread:

Leif Madsen wrote:
> 1) If we add new pattern matching functionality, it'd be a good idea
> to have an option of turning it on and off.

Using a different notation negates the need for an option to turn it off
and on.

But most importantly, it ensures 100% backwards compatibility.

> But, if we want Asterisk to step up into a higher volume use arena,
> we absolutely have to re-do the extension pattern matching algorithm

>  until we re-do realtime, we have to keep the current algorithm around.

> To avoid changing the current pattern syntax, we could use your suggestion
> for /pattern/ type notation. That way, we get the best of both worlds,
> some more powerful pattern matching, without sacrificing the speed of a
> simpler tree-based pattern representation. 

(Please forgive my merciless paraphrasing of your comments). I agree;
Implementing your new system (with some regex-inspired enhancements)
using a new notation would seem to be the best of both worlds.

As opposed to enhancing the existing system which would make many people
very nervous or implementing the full regexp library which has other
problems as I discuss further down.

> Here's what I THINK I could provide in the way of a regex-subset, and
> still use the tree algorithm to speed up the match:

There is a lot to digest here but here is my first pass at it. I reserve
the right to completely change my mind on some of these as I give it
more thought.

> First of all, regexps implicitly begin with ^ (start of line), and end
> with $ (end of line). I don't picture just looking for patterns
> somewhere in the string.

I'm a bit confused by that statement. Are you saying that the
expression:

exten => /1204/,1, ...

would be implicitly converted to

exten => /^1204$/,1, ...

? The implicit "^" might make sense, but not the "$". I don't much like
the idea of the implicit binding of the patterns to the beginning and/or
end of the match so perhaps you could elaborate why you think that's
good?

> [xyz] character class (simple)  (also called "bracket expression")
> [a-zA-Z] character class (range(s))
> [^xyz]   anti-char class
> [-xyz] include the - in the class
> [^-xyz] anything but the -,x,y,z

All good.

> (I'm not wild about providing []], [^]],

>  [:whatever:],  [=x=], or [.xyz.]

Agreed. Those are the Perl and POSIX character classes. I don't see a
reason for any of those. I don't even know what [=x=] does! In any case
almost none of them (other than [:digit:] ) make sense in a dialplan.

> I could allow the use of *, +, ? for single chars and be OK.
> a*        0 or more a's
> a+        1 or more a's
> a?        0 or 1 a's.
> [0-9]+    1 or more digits.
> [^0-9]*   0 or more non-digits

> The OR-notation could be acceptable:
> nancy|jim|james  will match "nancy", "jim", or "james".

> . would be ok to mean any one char; it'd stand for a char class with
> everything in it.

All good.

> We could use {3,}, {,3}, {3}, and {3,5} notation also; when we iterate
> on a repetition, we can keep counts and terminate the match the moment
> the count is violated.

Also good. Quite essential actually.

> Paren grouping is a bit problematic. I'd prefer to not allow parens in
> the regex.

There is no reason to have grouping since you aren't returning matches
to variables, your just matching dial strings.

However, some people use them for the OR-notation but square brackets do
the same thing:

e.g. /1[306|204]555[0-9]{4}/

This should match: 1204555XXXX or 1306555XXXX.

> Trailing context: stuff like .+ABC  or .+X[yz]+ probably won't work like
> you hoped. If you are looking for something that ends in ABC, hopefully,
> the char in front of that is something you can match against, or A won't
> appear anywhere else, so you can say [^A]+ABC.  Trying to do trailing
> context is kinda expensive,

> you need to recursively call the pattern matcher on the remaining part
> of the string. Ouch. Yes! I'm concerned about CPU cycles! When you have
> a 1000 calls coming in every minute, we can't afford to be recursively
> searching for a pattern. In most cases with extensions, you know how
> long the string will be; if you want only something ending with xyz,
> then say .{7}xyz and be done with it.

I actually have a real-world example that inspired this entire thread
which needs this.

Given the scenario where you need a caller to enter an unknown number of
digits terminated by the "#" key, you need trailing matching and of
string binding. e.g.

/[0-9]*#$/

On the general topic of expensive queries; Just like programming regular
expressions in any language, I think it should be the responsibility of
the dialplan programmer to optimize their queries.

It seems to me that you shouldn't remove potentially powerful
functionality because it might be CPU intensive if used. Rather it
should simply be documented as to how to optimize. For one thing, the
vast majority of Asterisk users are not putting thousands of calls
through their system so even the most expensive matching would have zero
impact.

> Scoring: I could use a similar algorithm to what I'm using in the tree
> algorithm, just fit in the extra patterns. In general, the more general
> the pattern, the less "best" it is. So, a direct match char is best, the
> bigger the char range, the lesser its score, etc. IIRC, 3072504105 would
> beat [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] would beat
> [0-9]{10} would beat [0-9]+

In general I agree with what your saying but I don't know if I agree
with the specific example. In what way is:

[0-9][0-9][0-9]

more specific than:

[0-9]{3}

? Those are exactly equivalent in my books. Actually I'm afraid I don't
really understand why you need scoring. The extension should match a
pattern only once it becomes completely unambiguous. For example:

(a) exten => /[0-9]{3}/
(b) exten => /1[0-9]{2}/

Should NOT be a valid dialplan since the both expressions would match
"1204". I'm assuming in your scoring system, the "b" expression would
score higher than "a" for any 1XXX expression ? I don't like this. Again
I think it should be up to the dialplan designer to do the job right. In
this case I believe it would be:

(a) exten => /[02-9]{3}/
(b) exten => /1[0-9]{2}/

> So, the question devolves to: what features of regex would you really
> like to have, given the above? What have I thrown out that you would
> really like?

After revisiting regex and all their power I am in agreement that
implementing the full library doesn't make much sense. There is simply
way to much of the regex stuff that doesn't have any use in a dial plan.
Things like white space handling, line breaks etc.

One thing I would definitely like to see in is the case-insensitive
flag. As in Perl; e.g.

exten => /nancy/i,1, ...

And perhaps \d for digit and \D for non-digit? It's a bit faster to type
than [0-9] every time.

> This stuff, added to the tree matcher I have, will mercilessly
> complicate it, but I think I could make it work.

Tree Optimized and Enhanced pattern matching.

a.k.a. TOE Pattern Matching, or the TOE project. ;)

John Lange