Nov 17, 2013

How and when I use constants in regular expression

Premise

In my product some majorly used values are declared in MyAPP::Constants. One benefit of declaring constants is that, by giving name to ambiguous values, we can have better readability. Sample goes as follows:
package MyAPP::Constants;
use strict;
use warnings;
use utf8;
use parent qw/Exporter/;

our @EXPORT;
use Exporter::Constants (
    \@EXPORT => {
        # campaign types to be stored in DB
        CAMPAIGN_TYPE_HALLOWEEN     => 1,
        CAMPAIGN_TYPE_THANKSGIVING => 2,
        CAMPAIGN_TYPE_CHRISTMAS       => 3,
    },
);
For validation I use Data::FormValidator and validation rules are defined in MyAPP::Validator::Constraints like below.
package MyAPP::Validator::Constraints;
use strict;
use warnings;
use utf8;

use parent 'Exporter';
use Module::Functions;
use MyAPP::Constants;

# all public methods are exported
our @EXPORT = Module::Functions::get_public_functions();

sub VALID_BOOL () { qr/\A (0|1) \z/x }
sub VALID_CAMPAIGN_TYPE () { qr/\A [123] \z/x } # see MyAPP::Constants for CAMPAIGN_TYPE_*

How and when to use constants in regular expression

Despite the fact that using constants increases readability, this VALID_CAMPAIGN_TYPE's regular expression obviously doesn't benefit from it. In this case, I believe, we should use constants in regular expression to avoid this chaos, but how?
I used dereference-reference trick to do this.
sub VALID_CAMPAIGN_TYPE () {
    qr/\A
        (?:
            ${\( CAMPAIGN_TYPE_HALLOWEEN     )}
          | ${\( CAMPAIGN_TYPE_THANKSGIVING )}
          | ${\( CAMPAIGN_TYPE_CHRISTMAS       )}
        )
    \z/x
}  
Syntax is a bit tricky at the first glance, but now you can tell what values can be set. The most important thing is that if I read this piece of code 3 months later it still makes sense. I or my poor co-worker don't have to go like.... well what values can/should go here? what does qr/\A[123]\z/  mean? I found a link to wiki, but it seems outdated and I'm not sure what goes here...

Benchmark

I used the code below to measure its performance.
#! /usr/bin/env perl
use strict;
use warnings;
use Benchmark qw/:all/;

use constant +{
    STR => 'foo',
    INT => 123,
};

my $input = 'foo';
cmpthese(
    5000000,
    +{
        'plain' => sub {
            $input =~ qr/\A (?: foo ) \z/x;
        },
        'const' => sub {
            $input =~ qr/\A (?: ${\(STR)} ) \z/x;
        },
        'plain_o_modifier' => sub {
            $input =~ qr/\A (?: foo ) \z/xo;
        },
        'const_o_modifier' => sub {
            $input =~ qr/\A (?: ${\(STR)} ) \z/xo;
        },
    }
);

__END__
The result is as we can expect. Plain regular expressions with or without /o modifier are faster and the regexp that uses constant with /o modifier follows. The one that uses constant without /o modifier significantly decreases its performance.
[oklahomer]~% perl benchmark.pl 
                     Rate      const const_o_modifier plain_o_modifier     plain
const            598086/s         --             -17%             -20%      -21%
const_o_modifier 721501/s        21%               --              -4%       -4%
plain_o_modifier 748503/s        25%               4%               --       -1%
plain            753012/s        26%               4%               1%        --
So my conclusion goes:
  • use /o modifier if possible
  • even with /o modifier, the performance varies on process's life cycle because of /o modifier's nature so...
    • see if readability is more important
    • or performance has higher priority 
Well... I think it performs well enough so I use constants anyway, though.