Jan 21, 2014

Convert English Sentence to Pig Latin with Perl

Every once in a while I hear them talk Pig Latin in movies. I repeat it so many times in my mind and try to understand the original line, but it takes time because English is my second language and I'm not used to its transition. So I came up with this idea: go lookup wikipedia and related linguistics articles and create a Perl module that pig-latinize given sentence(s). That should help me understand the rule.
Before launching vim, I searched CPAN for similar module and it didn't take long before I found what I wanted. Lingua::PigLatin converts given sentence to Pig Latin with simple regular expression below.
    s/\b(qu|[cgpstw]h # First syllable, including digraphs
    |[^\W0-9_aeiou])  # Unless it begins with a vowel or number
    ?([a-z]+)/        # Store the rest of the word in a variable
    $1?"$2$1ay"       # move the first syllable and add -ay
    :"$2way"          # unless it should get -way instead 
    /iegx; 
Since my goal was to understand this game's rule through coding, I read what this regular expression did. I'm not a regular expression expert and it was a bit difficult to understand at once so, with a help of Perl Best Practice, I modified its coding style to increase my readability as below.
s{\b                   # See if each given word starts with...
    (   qu             # 1. qu (e.g. question => estionquay)
      | [cgpstw]h      # 2. digraphs
      | [^\W0-9_aeiou] # 3. any "word" character other than 0-9, _ and vowels
    )?
    (                  # and followed by...                               
        [a-z]+         # alphabet character(s)
    )
}
{
    $1       # if the first rule applies
  ? "$2$1ay" # then append former part and add -ay,
  : "$2way"; # otherwise add -way
}iegx;
Now things became clearer. Here is what it does.
First, it checks every words' beginning by having \b at the very beginning of this expression. It checks if it starts with or without any of the 3 rules below:
  1. starts with "qu"
  2. starts with digraphs such as ch, gh, ph, sh, th and wh to capture words like channel, shell and what
  3. starts with any word character other than 0-9, _ and vowels(AEIOU)
Second, it checks if the following characters are all alphabet.
If both first and second steps apply, the first part is appended at the bottom of the word with -ay; If only second step applies, it just put -way at the end; If both don't apply, it does nothing with the word.

O.K. now I understand how Pig Latin works. But I have a new question. Do Americans really do this in their mind while they are just talking those things that randomly come to their mind? Can't believe it...