Parsing domain names out of emails

January 8, 2013

I’ve just written a Perl CPAN module called Net::Domain::Regex that can be used to parse free form text and extract a series of domain name matches that are broken down into their components.

#!/usr/bin/perl -w
use strict;
use Net::Domain::Regex;
use Data::Dumper;

my $r = Net::Domain::Regex->new;

my $email = do {
        local $/;
        open FD, "</tmp/google.email";
        <FD>;
};

# Filter out some of the domains from the headers etc.
my @results = grep { !($_->{tld} eq 'com' and $_->{domain} =~ /google|tucows|hostedemail/) } $r->match( $email );

# Map it so that we get the FQDN and the Domain name
my $r = {};
for( @results ){
        print $_->{match},"\n";
        $r->{$_->{domain} . "." . $_->{tld}}->{$_->{match}}++;
}

print "Parsed Domains:",Dumper( $r ),"\n";

This will result in the following:

pblair@T500:~$ perl bin/extract_domains.pl
"my" variable $r masks earlier declaration in same scope at bin/extract_domains.pl line 18.
google.trakken.com
google.trakken.com
webcindario.com
febf653bd95.webcindario.com
webcindario.com
febf653bd95.webcindario.com
Parsed Domains:$VAR1 = {
          'trakken.com' => {
                             'google.trakken.com' => 2
                           },
          'webcindario.com' => {
                                 'webcindario.com' => 2,
                                 'febf653bd95.webcindario.com' => 2
                               }
        };

Discussion, links, and tweets

Follow me on Twitter