Dec 30, 2013

FQL: How to retrieve "furigana" for Japanese user name

Japanese Writing System

Japanese language is very unique and its origin is still debated. Its writing system is also unique in terms of having three scripts:
  • kanji -- ideographic characters borrowed from Chinese
  • hiragana -- phonogramic characters, originally a simplified form of kanji 
  • katakana -- phonogramic characters, originally derived from components of kanji 

Problem with Kanji

Kanji consists of more than 2,000 commonly used characters while hiragana and katakana each consist of only 46 characters. This extremely large amount of kanji becomes troublesome to most people.
One more tough thing about Japanese kanji is that each character has more than two pronunciations: Chinese original pronunciation -- on-yomi(音読み) -- and Japanese original pronunciation(s) -- kun-yomi(訓読み). So the problem is that when kanji characters are combined and used in person names, we can't really tell what pronunciation to use.

Hiragana and Katakana as Reading Aid

With experience, people can tell how to pronounce commonly used person names, but telling each pronunciation in programatic way requires a large dictionary and seems almost impossible. In most cases, since hiragana and katakana are phonograms, we use them to help tell the pronunciation. When hiragana or katakana is explicitly used in this purpose, it is called furigana(フリガナ). So most user registration system obligate users to input furigana along with their original kanji names.

Retrieving Furigana with FQL

In 2012, I implimented Facebook social login to my service and found it difficult to retrieve furigana of registering user name. It was really frustrating. Using social login should simplify both user experience and source code, but if we can't retrieve furigana, we have to obligate users to input manually, which I think ruins user experience at first place. I checked up whole Facebook Graph API and FQL documents and finally found a way to do it.
There are columns called sort_first_name and sort_last_name on user table and these columns returns furigana for first name and last name. The minimum query is as below:
SELECT first_name, sort_first_name, last_name, sort_last_name,name FROM user WHERE uid = 4400758
On request, depending on your Facebook app settings, you must add locale=ja_JP parameter.
curl -X GET 'https://graph.facebook.com/fql?q=SELECT+first_name%2C+sort_first_name%2C+last_name%2C+sort_last_name%2Cname+FROM+user+WHERE+uid+%3D+44007581&locale=ja_JP'
You must be careful if the user hasn't input Japanese name -- mostly those Japanese users who registered back in those days Facebook was served only in English may not have registered kanji and furigana -- returns latin characters in sort_*_name columns.

Below is the code I used.
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use Facebook::OpenGraph;
use Data::Dumper;
use Data::Recursive::Encode;

my $fb  = Facebook::OpenGraph->new;
my $ret = $fb->fql('SELECT first_name, sort_first_name, last_name, sort_last_name,name FROM user WHERE uid = 44007581');
$ret = Data::Recursive::Encode->encode_utf8($ret || +{});
warn Dumper $ret;
#$VAR1 = {
#          'data' => [
#                      {
#                        'sort_first_name' => 'ゴウ',
#                        'name' => '萩原 豪',
#                        'first_name' => '豪',
#                        'last_name' => '萩原',
#                        'sort_last_name' => 'ハギワラ'
#                      }
#                    ]
#        };

Dec 26, 2013

Teng::Row and data2row

EDIT: 2014-03-21
Added follow-up artile, Ideas on utilizing Teng#new_row_from_hash

Premise

Lately I switched my O/R mapper from Data::ObjectDriver to Teng and I really enjoy its lightness.
Some O/R mappers provide so many functionalities and, hence, they are heavy and hard to track source codes. DOD's transparent caching was tricky, for instance. Despite the fact that they are functionality-rich, when we try to write our own SQL statement, we have to take DB handle out of them and do the rest by ourselves. Then we can't make use of O/R mapper's result object, any more.
Teng, on the other hand, works nicely as O/R mapper and DBI wrapper. Execute() wraps DBI and it handles the creation of statement handle and its execution. By the way dbi(), which is called in execute(), takes care of process id check and reconnection so it's folk-safe. Other O/R mapper-ish methods for CRUD including search_by_sql() calls execute() and my favorite part is that search_by_sql() still creates Teng::Iterator object so we can use Teng::Row's functionalities with our own SQL statements.
my $itr = $teng->search_by_sql(q{
    SELECT service.*
    FROM service
    LEFT JOIN service_ranking
    ON service.id = service_ranking.service_id
    ORDER BY service_ranking.rank IS NULL service_ranking.rank ASC
}, [], 'service');
# $itr is-a Teng::Iterator
When search_by_sql() is called in list context, it returns $itr->all() so you get an array of MyApp::DB::Row::Service objects. It's really handy that you can execute complex statement and still make use of Teng::Row. It becomes even more powerful when you extend Teng::Row with MyAPP::DB::Row::*. $teng->search_by_sql() will detect table name from statement and creates corresponding table row objects. Or you can explicitly set table name via 3rd arguments, witch is 'service' in the previous example.

My Problem and Workaround

Sometimes, when retrieving column values from 2 or more tables, I get confused which table name I should set on 3rd argument. Since retrieved values don't represent any particular object or table, it's natural that there is no corresponding table row class. In such case, I called $teng->execute() and tried to create table row objects for each table. The code was something like below.
my $sth = $teng->execute(q{
    SELECT user.name, user_ranking.rank, user_ranking.fluctuation 
    FROM user
    LEFT JOIN user_ranking
    ON user.id = user_ranking.user_id
    ORDER BY user_ranking.rank IS NULL user_ranking.rank ASC
}, []);

my $user_row_class = $teng->schema->get_table('user')->{row_class};
my $ranking_row_class = $teng->schema->get_table('user_ranking')->{row_class};
my @ranked_users;
while (my $hashref = $sth->fetchrow_hashref) {
    my $user_row = $user_row_class->new(+{
        row_data => +{
            id => $hashref->{user_id},
            name => $hashref->{name},
        },
        table_name => 'user',
        teng => $teng,
    });

    my $ranking = $ranking_row_class->new(+{
        row_data => +{
            user_id => $hashref->{user_id},
            rank => $hashref->{rank},
            fluctuation => $hashref->{fluctuation},
        },
        table_name => 'user_ranking',
        teng => $teng,
    });

    # do something with MyApp::DB::Row::User and MyApp::DB::Row::UserRanking objects
    push @ranked_users, +{
        name => $user->name,
        rank => $ranking->rank,
        fluctuation => $ranking->fluctuation_str, # fluctuation_str() returns stringified fluctuation: UP, STAY and DOWN 
    };
}

Solution

The previous code worked O.K. to me, but this process was somewhat complex and I didn't really know what arguments I should specify on $row_class->new(). So I looked for some workaround and what I found was this. This $teng->data2row() allows me to do what I did above and, additionally, it sets dummy query as $row->{sql} so it should be easier to debug especially when get_column() doesn't work and the message "Specified column 'some_nongiven_column' not found in row (query: ..... )" is given on croak(). This method, however, is not merged to master branch, yet.
I'm more than happy to have this method in the near future.

EDIT: 2014-03-06

Now this method is merged to master branch as Teng#new_row_from_hash.

Dec 15, 2013

How Furl handles userinfo part of URI

I didn't know until several days ago that Furl handles userinfo part of URI. Actually I didn't really get to care about it because a) I used Data::Validate::URI for URI validation and it didn't allow username:password@sample.com format so userinfo couldn't be any part of my product, and b) RFC3986 clearly discouraged the use of username:password in URI string.
7.5. Sensitive Information
URI producers should not provide a URI that contains a username or password that is intended to be secret. URIs are frequently displayed by browsers, stored in clear text bookmarks, and logged by user agent history and intermediary applications (proxies). A password appearing within the userinfo component is deprecated and should be considered an error (or simply ignored) except in those rare cases where the 'password' parameter is intended to be public. RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
I was reading Furl::HTTP's code and accidentally found its implementation. It parses input URI in _parse_url() and if username is present, it properly sets username:password combo in Authorization header.
if (defined $username) {
     _requires('MIME/Base64.pm', 'Basic auth');
     push @headers, 'Authorization', 'Basic ' . MIME::Base64::encode_base64("${username}:${password}");
}
Then I became curious if LWP::UserAgent allows the use of userinfo, and I found that it also allows the use of it, too.

Dec 14, 2013

Getting the most out of Data::FormValidator



Introduction

This is my validator. There are many like it, but this one is mine. My validator is my best friend. It is my life. I must master it as I must master my life. marinized side of Oklahomer
Sometimes, when taking over a project, you find insufficient document, ambiguous table schema and loose validation. Suppose there is a table column named "properties_json" and the validation is so loose that you can't tell what values go into it. Probably you look up a project wiki, but it's outdated or, even worse, there is no entry about it. Now things are pretty tough on you.
Sometimes I've done this to someone and sometimes someone has done this to me. With these experiences I acquired some habits:
Tighter validation
makes clear what values come in and out. If it is tight enough, when wiki or document is outdated, it works as living document that declares detailed specs. 
Detailed document
POD, wiki, README.mkdn or any other kinds. If appropriate, parse *.mkdn and POD and display them on admin pages so everybody sees them and keeps them updated.
More full-line/inline comments
including links to wiki, quote from document and regular comments

This article covers how I use Data::FormValidator to achieve the first habit above.

E pluribus unum / Out of Many

First things first. Why Data::FormValidator? For form and/or data validation purpose, there are many modules including FormValidator::Simple, FormValidator::Lite, Smart::Args, Data::Validator to name a few. I'm not quite sure which validator is the best so I benchmarked some form validator modules. The result is as follows:

#!/usr/bin/env perl
use strict;
use warnings;
use Modern::Perl;
use FormValidator::Lite qw/Email/;
use Data::FormValidator;
use Data::FormValidator::Constraints qw/:closures/;
use FormValidator::Simple;
use Benchmark qw/:all/;
use CGI;

say "Perl: $]";
say "FormValidator::Simple: $FormValidator::Simple::VERSION";
say "FormValidator::Lite: $FormValidator::Lite::VERSION";
say "Data::FormValidator: $Data::FormValidator::VERSION";

#Perl: 5.018001
#FormValidator::Simple: 0.29
#FormValidator::Lite: 0.37
#Data::FormValidator: 4.81

#                        Rate FormValidator::Simple Data::FormValidator FormValidator::Lite
#FormValidator::Simple 1034/s                    --                -20%                -66%
#Data::FormValidator   1294/s                   25%                  --                -57%
#FormValidator::Lite   3030/s                  193%                134%                  --

my $q = CGI->new;
$q->param(name => 'oklahomer');
$q->param(mail1 => 'sample@sample.com');
$q->param(mail2 => 'sample@sample.com');

cmpthese(
    10000,
    +{
        "FormValidator::Simple" => sub {
            my $res = FormValidator::Simple->check($q => [
                name  => ['NOT_BLANK', [ 'LENGTH', 5, 10 ]],
                mail1 => [qw/NOT_BLANK EMAIL_LOOSE/],
                mail2 => [qw/NOT_BLANK EMAIL_LOOSE/],
                +{mails => [qw/mail1 mail2/]} => [qw/DUPLICATION/],
            ]);
        },
        "FormValidator::Lite" => sub {
            my $res = FormValidator::Lite->new($q)->check(
                name  => ['NOT_NULL', [ 'LENGTH', 5, 10 ]],
                mail1 => [qw/NOT_NULL EMAIL_LOOSE/],
                mail2 => [qw/NOT_NULL EMAIL_LOOSE/],
                +{mails => [qw/mail1 mail2/]} => [qw/DUPLICATION/],
            );
        },
        "Data::FormValidator" => sub {
            my $res = Data::FormValidator->check($q, +{
                required => [qw/name mail1 mail2/],
                constraint_methods => +{
                    name  => FV_length_between(5, 10),
                    mail1 => email(),
                    mail2 => [email(), FV_eq_with('mail2')],
                },
            });
        },
    }
);

__END__

I am a bit confused to find that FormValidator::Simple is the slowest. Among these three Data::FormValidator is the oldest and FormValidator::Lite is the latest. So I assumed Data::FormValidator would be the slowest, but apparently it wasn't. FormValidator::Lite seems pretty fast, but its version is still less than 1.00 and has some experimental methods; Data::FormValidator, on the other hand, has longer history and is stable. Also, D::FV makes easy things easy and difficult things possible for me. User documentation and technical documentation are well documented, too. And my product uses Data::FormValidator after all so there I am.

Basics

Flow

When check() is called, it checks the validation profile and creates Data::FormValidator::Result object via D::FV::Result#new. In this initialization, _process() is called and that's where validation is done. In _process() things are done this order:

  1. filters are applied
    1. filters
    2. field_filters
    3. field_filter_regexp_map
  2. prepare required params
    1. required
    2. required_regexp
    3. require_some
  3. remove empty fields
  4. check dependencies
    1. dependencies
    2. dependency_groups
  5. add default values to unset fields
    1. defaults_regexp_map
    2. defaults
  6. check required
    1. required
    2. require_some
  7. check constraints

Validation Profile

Declaration Order

To me one of the most important thing is that this profile tells us what values are required and what optional values can go; It, in other words, tells us what can not go through here. So I code validation profile in this order:
  1. required, require_somedependency_groups, dependencies ... these make clear what values are required.
  2. optional ... needless to say it declares what fields are optional
  3. filters, field_filters ... states how each values should be altered
  4. default ... declares default values for optional fields
  5. constraint_methods ... Last Line of Defense
In this way you can see important things on top. Required fields go first and optional comes second so you can see what field values come. Then filter part follows and it tells how to alter the input values. Default values come after filter because in _process() filters are applied before default values are set so it is easier to understand if filter and default are written in this order. Finally constraint_methods comes and checks everything.
It looks like something below.

use constant {
    FALSE => 0,
    TRUE   => 1,
}

Data::FormValidator->check($input, {
    required => [qw/user_id user_id_confirm/], # MUST be set
    require_some => {
        # At least 2 of these must be set
        city_or_state_or_zipcode => [ 2, qw/city state zipcode/ ],
    }, 
    dependency_groups => {
        # if one is given, then all are required
        basic_auth => [qw/realm username password/],
    },
    dependencies => {
        # if "oh please send me junk mails" is checked, email must be given
        send_dm => {
            TRUE() => [qw/email email_confirm/],
        }
    },
    optional => [qw/website hobbies/],
    filters => ['trim'],
    field_filters => {
        hobbies => [
            # comma separated hobbies to arrayref
            sub {
                [split q{,}, shift]
            },
            # trim each hobby value after split
            # BEWARE: "filters" is applied before field_filters is applied
            'trim'
        ]
    },
    default => {
        page => 1,  # page number
        rows => 30, # max rows per page
    },
    constraint_methods => {
    },
});

Avoid Using *_regexp_map

On profile declaration, I don't really like to use *_regexp_map. As I described earlier, I expect profile to make clear what values go in. So I avoid *_regexp_map and like to clearly state each field.

Define Constraint Methods in One Place

MyApp::Validation::Rules or some place where you can always refer to from each Model.

Use Constants in Regular Expressions to Make Rules Readable

As I described in 'How and when I use constants in regular expression,' I use constants in regular expression to increase readability when I use regular expression as constraint method. I also explained using constants can decrease performance, but validation rules are vital so I think we shouldn't hesitate to make a use of it.
Sample is as follows.

package MyApp::Constants;
use strict;
use warnings;
use utf8;
use parent 'Exporter';

our @EXPORT;

use Exporter::Constants (
    \@EXPORT => {
        CAMPAIGN_TYPE_HALLOWEEN    => 1,
        CAMPAIGN_TYPE_THANKSGIVING => 2,
        CAMPAIGN_TYPE_CHRISTMAS    => 3,
    }
);

1;

package MyApp::Validator::Constraints;
use strict;
use warnings;
use utf8;
use parent 'Exporter';
use Module::Functions;
use MyApp::Constants;

our @EXPORT = Module::Functions::get_public_functions();
    
sub VALID_CAMPAIGN_TYPE () {
    qr/\A
        (?:
            ${\( CAMPAIGN_TYPE_HALLOWEEN    )}
          | ${\( CAMPAIGN_TYPE_THANKSGIVING )}
          | ${\( CAMPAIGN_TYPE_CHRISTMAS    )}
        )
    \z/xo 
}

1;

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use lib 'lib';
use Data::FormValidator;
use MyApp::Validator::Constraints;

my $result = Data::FormValidator->check(+{
    campaign_type => 2,
}, +{
    required => [qw/campaign_type/],
    optional => [qw//],
    constraint_methods => +{
        campaign_type => VALID_CAMPAIGN_TYPE,
    },
});

use Data::Dumper;
warn Dumper scalar($result->valid);
#$VAR1 = {
#          'campaign_type' => 2
#        };

__END__

Advanced

The profile above is self-explanatory so I am going to describe some minor futures I like to use.

Dealing with Empty Fields

By default setting, when a user provides empty string for optional fields, that field and its value are not accessible via $result->valid(). It's O.K. as long as we are working on data creation, but when it comes to data update, this default behaviour is troublesome. If user has already set "my_previous_nickname" as his nickname and decides to remove it, empty string will be given on form submission and this nickname field is not accessible via $result->valid(). Then the code below wouldn't update nickname field.  

my $result = Data::FormValidator->check(+{
    name     => 'Oklahomer',
    nickname => '',
    email    => 'nickname.is.empty@sample.com',
}, +{
    required               => [qw/name email/],
    optional               => [qw/nickname/],
});
my $valid = $result->valid;
$VAR1 = {
          'email' => 'nickname.is.empty@sample.com',
          'name' => 'Oklahomer'
        };
$teng->update(user => $valid); # nickname is not set so this field stays as is.
Solution is pretty easy. By setting missing_optional_valid => 1 on profile declaration, empty optional fields become accessible like example below.

#!/usr/bin/env perl
use strict;
use warnings;
use Data::FormValidator;
 
my $result = Data::FormValidator->check(+{
    name     => 'Oklahomer',
    nickname => '',
    email    => 'nickname.is.empty@sample.com',
}, +{
    required               => [qw/name email/],
    optional               => [qw/nickname/],
    missing_optional_valid => 1,
});
use Data::Dumper;
warn Dumper scalar $result->valid;
#$VAR1 = {
#          'nickname' => undef, # undef is returned now
#          'email' => 'nickname.is.empty@sample.com',
#          'name' => 'Oklahomer'
#        };
 
1;

Dealing with Array Elements

Checking Each Element of Array

When array reference is set as field value, D::FV applies constraint method to each element.
So when user can input comma separated hobbies as part of his profile, I use field_filter and this feature together. In this case I use plural form as key name so it seems appropriate when $result->valid->{hobbies} becomes array reference.

my $result = Data::FormValidator->check($input, {
    required => [qw/hobbies/],
    field_filters => {
        hobbies => [
            # comma separated hobbies to arrayref
            sub {
                [split q{,}, shift]
            },
            # trim each hobby value after split
            # BEWARE: "filters" is applied before field_filters is applied
            'trim'
        ]
    },
    constraint_methods => {
        hobbies => ALLOWED_HOBBY_RE, # checks each hobby
    },
}); 

Checking the Number of Elements

If you are using D::FV Version 4.80 or later, FV_num_values and FV_num_values_between constraints are officially supported, which count the number of list elements.

    constraint_methods => {
        hobbies => [
            FV_num_values_between(1, 5), # user can set 1-5 hobbies
            ALLOWED_HOBBY_RE, # f-words are denied here
    },
 
Of course these methods are called as many times as the number of elements because constraint methods are applied to each element. It may not be cool, but it is how constraint methods are applied in D::FV.

Validating Based on Multiple Fields

My favorite feature with D::FV is that we can validate a field based on multiple field values -- if country code says XX then phone # should be N digits and if country code says YY then phon # should be M digits.

    constraint_methods => +{
        country_code => VALID_COUNTRY_CODE,
        phone => +{
            constraint_method => sub {
                 my ($dfv, $country_code, $phone_number) = @_;
                 # check phone_number and country_code combo.
            },
            params => [qw/country_code phone_number/],
        }
    }

Conclusion

That's how I use D::FV. By declaring profile correctly it makes things easy and with appropriate use of filter/field_filter, the result is not only the valid input, but can be provided in the form that we can treat easy in later code.
My validator and I are the defenders of my data. We are the masters of our data. We are the saviors of my data. So be it, until victory is ours and there is no enemy. marinized side of Oklahomer

Nov 17, 2013

How and when I use constants in regular expression

Premise

In my product some majorly used values are declared in MyAPP::Constants. One benefit of declaring constants is that, by giving name to ambiguous values, we can have better readability. Sample goes as follows:
package MyAPP::Constants;
use strict;
use warnings;
use utf8;
use parent qw/Exporter/;

our @EXPORT;
use Exporter::Constants (
    \@EXPORT => {
        # campaign types to be stored in DB
        CAMPAIGN_TYPE_HALLOWEEN     => 1,
        CAMPAIGN_TYPE_THANKSGIVING => 2,
        CAMPAIGN_TYPE_CHRISTMAS       => 3,
    },
);
For validation I use Data::FormValidator and validation rules are defined in MyAPP::Validator::Constraints like below.
package MyAPP::Validator::Constraints;
use strict;
use warnings;
use utf8;

use parent 'Exporter';
use Module::Functions;
use MyAPP::Constants;

# all public methods are exported
our @EXPORT = Module::Functions::get_public_functions();

sub VALID_BOOL () { qr/\A (0|1) \z/x }
sub VALID_CAMPAIGN_TYPE () { qr/\A [123] \z/x } # see MyAPP::Constants for CAMPAIGN_TYPE_*

How and when to use constants in regular expression

Despite the fact that using constants increases readability, this VALID_CAMPAIGN_TYPE's regular expression obviously doesn't benefit from it. In this case, I believe, we should use constants in regular expression to avoid this chaos, but how?
I used dereference-reference trick to do this.
sub VALID_CAMPAIGN_TYPE () {
    qr/\A
        (?:
            ${\( CAMPAIGN_TYPE_HALLOWEEN     )}
          | ${\( CAMPAIGN_TYPE_THANKSGIVING )}
          | ${\( CAMPAIGN_TYPE_CHRISTMAS       )}
        )
    \z/x
}  
Syntax is a bit tricky at the first glance, but now you can tell what values can be set. The most important thing is that if I read this piece of code 3 months later it still makes sense. I or my poor co-worker don't have to go like.... well what values can/should go here? what does qr/\A[123]\z/  mean? I found a link to wiki, but it seems outdated and I'm not sure what goes here...

Benchmark

I used the code below to measure its performance.
#! /usr/bin/env perl
use strict;
use warnings;
use Benchmark qw/:all/;

use constant +{
    STR => 'foo',
    INT => 123,
};

my $input = 'foo';
cmpthese(
    5000000,
    +{
        'plain' => sub {
            $input =~ qr/\A (?: foo ) \z/x;
        },
        'const' => sub {
            $input =~ qr/\A (?: ${\(STR)} ) \z/x;
        },
        'plain_o_modifier' => sub {
            $input =~ qr/\A (?: foo ) \z/xo;
        },
        'const_o_modifier' => sub {
            $input =~ qr/\A (?: ${\(STR)} ) \z/xo;
        },
    }
);

__END__
The result is as we can expect. Plain regular expressions with or without /o modifier are faster and the regexp that uses constant with /o modifier follows. The one that uses constant without /o modifier significantly decreases its performance.
[oklahomer]~% perl benchmark.pl 
                     Rate      const const_o_modifier plain_o_modifier     plain
const            598086/s         --             -17%             -20%      -21%
const_o_modifier 721501/s        21%               --              -4%       -4%
plain_o_modifier 748503/s        25%               4%               --       -1%
plain            753012/s        26%               4%               1%        --
So my conclusion goes:
  • use /o modifier if possible
  • even with /o modifier, the performance varies on process's life cycle because of /o modifier's nature so...
    • see if readability is more important
    • or performance has higher priority 
Well... I think it performs well enough so I use constants anyway, though.