Simple Stemming with Perl
Stemming is the process for reducing inflected (or sometimes derived)
words to their stem, base or root form.
– Wikipedia article on
Stemming
Ever used a website that allowed you to tag content? Ever ended up accidently using slightly different tags? Something like graphs and graphing or blog and blogs? (I hope so, otherwise it’s just me…) To spot some of the more obvious overlaps you can stem each of the words and look for a common base. Where one’s found there is the possibility of mistaken duplication. For example if you passed hunts, hunted and hunting through a stemmer each would return ‘hunt’. If you want to try for yourself there are online stemmers available.
As a more concrete example let’s look at the wonderful service del.icio.us. You upload your own bookmarks, tag them with a number of keywords and can then group, sort and search them by your own defined terms. Except I have a habit of tagging articles about similar topics with nearly, but not quite the same tag.
The perl code below shows how easy it is (using Lingua::Stem from CPAN) to run your own data through a stemmer and look for overlaps. There are implementations in most languages (PyStemmer is also very nice) and the wikipedia article is actually a very easy to follow introduction.
#!/usr/bin/perl -w
use strict;
use warnings;
use Lingua::Stem;
use Net::Delicious;
my $del = Net::Delicious->new(
{
user => "username",
pswd => "password"
}
);
my $stemmer = Lingua::Stem->new( -locale => 'EN-UK' );
my %stems;
for my $tag ( $del->tags() ) {
my $stemmed = $stemmer->stem( $tag->tag );
push( @{ $stems{$stemmed->[0]} }, $tag->tag );
}
for my $stemmed (sort keys %stems ) {
# we only care about base words with more than one tag associated
next unless ( scalar @{ $stems{$stemmed} } > 1);
print "Possible duplicates -\n";
print " -- ";
print join(" : ", @{ $stems{$stemmed} }), "\n";
}