Print Story Programming Fun Challenge: spam checking
Diary
By edwin (Sun Jun 27, 2004 at 02:01:24 AM EST) (all tags)
Write a program to defeat one of the dirty tricks in the spammer's arsenal.


Like many people, I get sent a lot of spam. I use a Bayesian spam checker to guess which emails are spam and which are from real people. In order to do this, the spam-checking software needs to be able to work out what words are in the document. Hence, spammers have started disguising the incriminating words to make this harder. For example, they might write:
Xanax        as     X.a.n.a.x
Viagra       as     V.14.grA
mortgage     as     m or tg agé

(The last one is interesting - it means you can't rely on whitespace delimiting words.)

Your task, should you choose to accept it, is to attempt to reverse the process: write a program which reveals the actual words. This could be (and doubtless is) used as a component in a smap-killing program. Obviously it is not possible to do this perfectly, so your program will need to take a heuristic approach.

Details: Your program will be passed one command-line argument: a file name containing a dictionary of English words, one-per-line. This will be in the same format as the hoary Unix /usr/dict/words file, and will in fact probably be that file with some extra spam-related words added.

Your program should read lines containing spam from standard input and write your version of the correct output to standard out. Each line of input will contain 0 or more words and bits of punctuation. It won't contain HTML tags or other markup. It'll be in some ill-defined 8-bit encoding. When you reach end-of-file, quit.

You should assume that the output will be used by another program, not a human, so you don't need to worry about preserving capitalization or punctuation.

Judging: I will be looking for safety, accuracy, elegance, and speed, in roughly that order.

  • Safety: if your program crashes or I can see a buffer overflow, it's disqualified.
  • Accuracy: the more your output corresponds to my notion of the words that are present, the better. If I get time, I may also actually hook your program up to a spam detector and look at how much you improve or degrade the false-accept and false-reject rates.
  • Elegance: clean, comprehensible, "sweet" programs are preferred over brute-force spaghetti code.
  • Speed: I should be able to run this on 1MB of input without having to wait more than 30 seconds.

Entering: Post a top-level comment announcing your code (don't describe it there, though!) and then a reply to it that contains the code itself, along with a description of the algorithm used and instructions for making the program run (remember to format your code for posting using tmoertel's code formatting program as described in How to post code to K5 -- the easy way!). Please do not post code, algorithm descriptions,  or any direct hints at solving the problem as root-level comments here — it could spoil the fun for people who haven't had a chance to take a stab at the problem yet.

Please have your entries in by midnight PST on Sunday July 18th.

< Grab your elephant method | BBC White season: 'Rivers of Blood' >
Programming Fun Challenge: spam checking | 49 comments (49 topical, 0 hidden) | Trackback
Two m.U CH l1k3 by yicky yacky (5.33 / 3) #1 Sun Jun 27, 2004 at 02:10:03 AM EST

H@rD w0r,k 4 a P F C


----
Vacuity abhors a vacuum.
Nah by codemonkey uk (5.00 / 1) #2 Sun Jun 27, 2004 at 02:22:11 AM EST
The great thing about this is that it's as little or as much work as you like, and a good entry is actually a useful tool.

Consider this: Just outputing your input is a value (non disqualified) entry. Add a per-word call back, so you can plug and strip do a simple replacement of "invalid" words containing numerals, ie: V1agra -> fails dictionary test -> per char test: "1" maps to "i" -> dictionary pass -> output "viagra". Bingo. Expand on that.

--- Thad ---
Almost as Smart As you.

[ Parent ]
That's kinda my point, though. by yicky yacky (6.00 / 2) #4 Sun Jun 27, 2004 at 02:45:44 AM EST

You can see numerous ways to attack it straight away, but to do it well, in a language such as C(++), would take days, and be the basis of an interface to a full-blown product. Leave it to the perl-monkeys and their we-love-kneeding-text perversions. ;)


----
Vacuity abhors a vacuum.
[ Parent ]
Hmmm. I see your point. by codemonkey uk (6.00 / 2) #5 Sun Jun 27, 2004 at 02:56:36 AM EST
But that's not going to stop me giving it a go.

--- Thad ---
Almost as Smart As you.
[ Parent ]
I don't understand by psctsh (6.00 / 1) #13 Sun Jun 27, 2004 at 08:56:11 AM EST
Why go to all the trouble to strip words and recombine them and whatnot--why not simply throw out any emails containing more than X% misspelled words? 

I mean, realistically, if anyone sends a legitimate message without being able to understand the difference between words, numbers and spaces--fuck 'em.  I'm not going to read it.

[ Parent ]
I was just thinking the same thing by Xenocide (3.00 / 0) #16 Sun Jun 27, 2004 at 01:01:09 PM EST
I've been getting a lot of russian spam lately, but I can't read russian, so the joke's on them. On most browsers it looks like a ton of broken ascii text. At first I thought about a keyword trigger list, but when the "m or tg agé" example showed up a spell check failure came to mind. So that would stop funny spellings and other techniques, in theory. I'd expect if it became popular then 1) AOL would have a massive spammer problem 2) real spam would begin to include more and more pointless text at the bottom in an effort to bring the percentage down. Of course, doubling the bandwidth price of spam halves the amount of spam out there, although the impact on inbox sizes is negligable.

[ Parent ]
NOTE by gazbo (5.50 / 2) #3 Sun Jun 27, 2004 at 02:29:24 AM EST
There's a much easier way to post code to HuSi - wrap it in <ecode></ecode> tags.  It'll escape special chars automatically, and will work for both HTML and Autoformat post modes.

I recommend always assuming 7th normal form where items in a text column are not allowed to rhyme.

Heh by hulver (3.00 / 0) #28 Mon Jun 28, 2004 at 01:46:05 AM EST
I've just broken it.

I had to resort to the perl formatting script and HTML format to post it, as ecode changed my é into &#233;
--
Cheese is not a hat. - clock

[ Parent ]
Bah by gazbo (3.00 / 0) #39 Mon Jun 28, 2004 at 04:01:54 AM EST
Well, it works in HTML formatted at least:

ŠOeŽšoežŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ

I'll try fixing it for AF tonight.


I recommend always assuming 7th normal form where items in a text column are not allowed to rhyme.

[ Parent ]
Patch by gazbo (3.00 / 0) #42 Mon Jun 28, 2004 at 08:04:02 AM EST
Here

I built the patch (Utility.pm) against the latest CVS, as I'm not sure what state yours is in.  If that's a problem then I'm sure I can get you a different version.

I've just noticed that it also happens to contain a tweak I wrote to prevent autoformat from munging "I == teh rock and yuo == teh suck", as I've seen a few of those getting turned into <tt>

Figuring out how the patch works is left as an exercise for the reader :-)  You'll probably want to try and break it before putting it live, as it's a decidedly non-trivial (if small) patch.


I recommend always assuming 7th normal form where items in a text column are not allowed to rhyme.

[ Parent ]
Unusual request. by Evil Cloaked User (3.00 / 0) #6 Sun Jun 27, 2004 at 03:27:28 AM EST
I don't actually get much spam. Any chance you could send me some so that I have some sample files to work on? (r@nk13 @t e1r(0m d0t n3t (after passing it through a functioning solution that is) should get to me. Thanks.

Heh by gazbo (6.00 / 1) #7 Sun Jun 27, 2004 at 03:37:02 AM EST
I like the way you've spam-proofed your address.  After all, you wouldn't want your request for spam to be met by spam, now, would you :-P

I recommend always assuming 7th normal form where items in a text column are not allowed to rhyme.

[ Parent ]
Spam corpus by sam (3.00 / 0) #11 Sun Jun 27, 2004 at 04:10:03 AM EST
You can download bundles of gzipped spam at spamarchive.org. This "x. an@x"-type spelling crap is fairly new though, so there mightn't be too much of it there.

[ Parent ]
Damnation. by Evil Cloaked User (6.00 / 2) #12 Sun Jun 27, 2004 at 05:35:50 AM EST
I've just gone and crashed the office gateway machine by running a "suboptimal" PFC attempt on it. Fuck. Looks like an early start tomorrow then.

[ Parent ]
Examples by edwin (3.00 / 0) #8 Sun Jun 27, 2004 at 03:38:40 AM EST
Here are some chunks o'spam you may find useful as test data:
QUICKLY succeed by getting a   Ba c helor,   M a ster or   D o ctorate   U n iversity Degr e e

Habe eben im Fernsehen einen Bericht gesehen, in dem klar hervorging, dass libenesiche und kurdische Moslems in Berlin die Drogenszene und teilweise sogar das Rotlicht-Milieu beherrschen.

C1ALI-S & LEV1T-RA a1l0ws men to achieve an ERECTION up to 36 h0urs after 1NGESTION.

80+% savings O.rdering !
No p|rescription required .
D0CTOR & F.D.A a~ppr0ved !
Overnight s`hipping !

N0w V1SIT Our W^EBS1TE : C~1~I~C~K H~E~R~E

The _cable_filter will all0w you to receive

all the channels that you 0rder with your remote contr0l,}

want pay_perviews,XXX-movies,sp0rt events,special-events,^

http://b9002hosting.com/cable/

curium ,phase
,montpelier ,bonanza .

downdraft ,seepage
,i'd ,penates .Cheap softawre for you.

Special Offer #1
Windows XP Professional + Office XP Professional - $80
Special Offer #2
Adobe - Photoshop 7, Premiere 7, Illustrator 10 - $120
Special Offer #3
Macromedia Dreamwaver MX 2004 + Flash MX 2004 - $100

Click here for more cheap soft

Wonder why our priices are unbelievably L0W?
We are currently clearing our goods at incredibily cheeap sa1e-priice in connection with the shutdown of our shop and the closure of the stockhouse. Don't missss your lucky chance to get the best priicce on discoouunt software!


Sure, I'll write you a spam filter by Rogerborg (6.00 / 1) #9 Sun Jun 27, 2004 at 03:42:28 AM EST
Would you like me to get you a beer while I'm up?  Because, you know, that'd be no trouble at all!

-
Metus amatores matrum compescit, non clementia.
Yeah by hulver (3.00 / 0) #10 Sun Jun 27, 2004 at 03:58:18 AM EST
Couple of Packets of Crisps as well, thanks.
--
Cheese is not a hat. - clock
[ Parent ]
Jellied eels? by Rogerborg (3.00 / 0) #23 Sun Jun 27, 2004 at 08:59:19 PM EST
Or are you just walking a bit funny?

-
Metus amatores matrum compescit, non clementia.
[ Parent ]
No thanks by hulver (3.00 / 0) #24 Sun Jun 27, 2004 at 09:07:44 PM EST
I'm a northerner. Pickled egg would be nice though.
--
Cheese is not a hat. - clock
[ Parent ]
CAPTCHA by dr k (6.00 / 1) #14 Sun Jun 27, 2004 at 12:54:07 PM EST
I think this qualifies as a CAPTCHA, i.e. a Turing test to determine if you are human.

:| :| :| :| :|

Yep by Siddhi (6.00 / 1) #22 Sun Jun 27, 2004 at 06:40:54 PM EST
This problem is by no means easy. It looks easy, but it has a lot of hidden difficulties.

[ Parent ]
fun, but no cigar by martingale (6.00 / 1) #15 Sun Jun 27, 2004 at 12:57:10 PM EST
Just a quick note for those who think this challenge is helpful: deobfuscating words is a bad idea for spam filtering. It is much better (accurate) to keep the words obfuscated. For example, think of the obfuscation as somebody talking with a funny accent. When you deobfuscate, you remove the accent. But then all you hear is somebody talking about something, while before you could immediately spot something was fishy because of the accent. Bayesian filters work the same way.

As a challenge with no practical applications however, I'm sure this is fun.
--
$E(X_t|F_s) = X_s,\quad t > s$

Not necessarily by ucblockhead (3.00 / 0) #17 Sun Jun 27, 2004 at 01:04:17 PM EST
The goal of the spammer is to sell his product. As such, he's got to communicate what that product is to the recipient. The reason for all of the obfuscation is that the product is the easiest thing for the bayesian filters to latch onto. Taking advantage of this problem the spammers have should give you an advantage over a filter that doesn't.
---
[ucblockhead is] useless and subhuman
[ Parent ]
nope by martingale (3.00 / 0) #18 Sun Jun 27, 2004 at 01:54:52 PM EST
That doesn't matter in the face of obfuscation. You're right that some spam is about advertising a product (but not all, much spam is about verifying the existence of mailboxes - if it doesn't bounce, it's a valid mail address which can be sold).

Now imagine talking to a child (= bayesian filter). You can talk about his friends, or you can talk about some political topic say. He can easily make out the difference. Now talk about politics in French. Does that make it any more likely he thinks it's about his friends? No, he'll place it in the political category, because both are incomprehensible (= not ham) to him.

Similarly, obfuscation serves to confuse spam within the collection of all spam messages, but doesn't make the message more like ham. On the contrary, ham messages don't obfuscate words as a general rule, so simply spotting some obfuscation is enough to strongly indicate spam, whatever spam-subtopic it turns out to be.

Of course, obfuscation may make it hard to spot the exact spam-subtopic, but who cares? Most people want to split ham vs spam, not spam/pr0n vs spam/mortgage.

But if you reverse the obfuscation, you help distinguish spam/pr0n vs spam/mortgage, but with a reduced vocabulary, you also make it harder to distinguish ham vs spam.

Does that make sense?
--
$E(X_t|F_s) = X_s,\quad t > s$

[ Parent ]
This just goes to show by Siddhi (3.00 / 0) #19 Sun Jun 27, 2004 at 02:17:38 PM EST
how amazing the human brain is in pattern recognition. We have no problem reading most of these obfuscated words at nearly full speed, even accross word boundaries.

Clarification by Siddhi (3.00 / 0) #20 Sun Jun 27, 2004 at 02:27:21 PM EST
Just a clarification, is this valid input - (mortgage as morgage, morkgage, or morttgage)

Any limits on the types of input we can get, or anything goes ?

and also by Siddhi (3.00 / 0) #21 Sun Jun 27, 2004 at 05:44:40 PM EST
will the dictionary contain grammatical words (a, an, and, the...) or do we need to check for them seperately in our code ?

[ Parent ]
Yes by edwin (3.00 / 0) #47 Mon Jun 28, 2004 at 02:40:19 PM EST
The dictionary will have pretty much every English word in it, including some proper names.

[ Parent ]
All of the above are legal by edwin (3.00 / 0) #46 Mon Jun 28, 2004 at 02:39:20 PM EST
I'm not sure what the question is, exactly. Sorry.

Basically any string of characters is legal input. However, if a human couldn't parse it as English, your program won't be penalized for not doing so either.

[ Parent ]
Thanks by Siddhi (3.00 / 0) #51 Wed Jun 30, 2004 at 01:01:02 AM EST
Yeah, basically what I wanted to know was if only the 3 obfuscations mentioned in the example were permitted, or any obfuscation is allowed.

[ Parent ]
Examples by hulver (3.00 / 0) #25 Mon Jun 28, 2004 at 01:26:50 AM EST
Has anybody got a script that will just strip out the body text of an email, and remove html tags as well?

I could knock one up, but it somebody already has one, that would be good.
--
Cheese is not a hat. - clock

Initial go by hulver (3.00 / 0) #26 Mon Jun 28, 2004 at 01:32:53 AM EST
This is just something I threw together. It's quite simple.

On an Athlon 2000xp with a 400k dictionary file and a 2.5mb spam file it runs in about 8s.
--
Cheese is not a hat. - clock

The code v 1. by hulver (3.00 / 0) #27 Mon Jun 28, 2004 at 01:44:56 AM EST
#!/usr/bin/perl

use strict;

my %words;
my $dict_name = shift;

open(WORDS, "<$dict_name") or die "$dict_name not found\n";

while (my $word = <WORDS>) {
       chomp $word;
       $words{lc($word)} = 1;
}
close WORDS;

my $text;
while (my $temp = <STDIN>) {
       $text .= "$temp ";
}

$text =~ s/[
--
Cheese is not a hat. - clock
[ Parent ]
Explination. by hulver (3.00 / 0) #29 Mon Jun 28, 2004 at 01:57:09 AM EST
This script does the simplest text substitution it can. It replaces common replacements é -> e etc, and attempts to remove obscufaction punctuation as best it can.

It then attempts to merge words together to find words in the dictionary.

get ho n ors deg ree

Would check

get

getho

gethon

gethonors

As that wouldn't be found, it would print the get and then move on

ho

hon

honors

It would then find that in the dictionary and continue

deg

degree

etc etc.

This does slow it down some, but it's still fairly nippy on my machine. This does fall down if you have a word that is in the dictionary, but part of another word that is in the dictionary.
A smaller, more refined dictionary of your problem words works best for this program, but it will perform OK with larger dictionaries, it will just miss some words if they are split on word boundries.

So it will fail if the input text is

V I A G R A

and

via is in the dicitonary.

I'm considering changing it to match the longest word, but I'm not sure how efficent at finding matches that will be.

I'll test it on some proper spam, and see how it goes.
--
Cheese is not a hat. - clock

[ Parent ]
V I A G R A by codemonkey uk (3.00 / 0) #30 Mon Jun 28, 2004 at 02:20:41 AM EST
But if GRA isn't in the dictionary, it could try merging that fragment left or right to make a valid word.

--- Thad ---
Almost as Smart As you.
[ Parent ]
Mmmm. by hulver (3.00 / 0) #31 Mon Jun 28, 2004 at 02:24:17 AM EST
I'm only doing left to right matching at the moment.

Right to left could get very complicated. Depends how far you try to merge as well.

I'm tempted to do "longest matching word" matching. I'll do that next. I think right to left would be too complicated,

Nice idea though.
--
Cheese is not a hat. - clock

[ Parent ]
I was considering that. by Evil Cloaked User (3.00 / 0) #32 Mon Jun 28, 2004 at 02:40:32 AM EST
There's an upper bound on the length of actual valid English words, so the plan is to put a variable somewhere stating an upper bound on words to look for and tweak it for the best speed/accuracy trade-off.

One piece of nastiness occurred to me though. The examples given use whitespace where they shouldn't so you can't rely on it as an actual word boundary. What's to stop them containing a lack of white-space where there should be white-sapce? Now that would be a fuck up.

[ Parent ]
Yeah by hulver (3.00 / 0) #34 Mon Jun 28, 2004 at 02:46:16 AM EST
I don't handle that at all. If it's that fucked up, I'm not touching it. Mine handles solutions with too much whitespace, but not with too little.

I'm thinking some deep voodoo would be required to handle that.
--
Cheese is not a hat. - clock

[ Parent ]
Don't Say That by codemonkey uk (3.00 / 0) #40 Mon Jun 28, 2004 at 04:05:00 AM EST
I hate it when people imply some problem can't be done, or would be really hard to do. It just makes me want to do it.

--- Thad ---
Almost as Smart As you.
[ Parent ]
Yep by hulver (6.00 / 2) #41 Mon Jun 28, 2004 at 04:13:44 AM EST
Next to fucking impossible. Only a genius could do it ;)
--
Cheese is not a hat. - clock
[ Parent ]
Thinking some more by hulver (3.00 / 0) #33 Mon Jun 28, 2004 at 02:44:19 AM EST
A "longest match first" solution would do the same thing as a right - left search, plus it's much simpler.
--
Cheese is not a hat. - clock
[ Parent ]
Damn you. by Evil Cloaked User (3.00 / 0) #36 Mon Jun 28, 2004 at 02:58:00 AM EST
Gimme back my algorithm!

[ Parent ]
Another thing by Evil Cloaked User (3.00 / 0) #37 Mon Jun 28, 2004 at 03:02:44 AM EST
Transposing 1 to i will give you trouble hu1ver.

[ Parent ]
Yes, I know. by hulver (3.00 / 0) #38 Mon Jun 28, 2004 at 03:20:44 AM EST
I might have to leave the 1 transposition until dictionary search time, and check them both. It seems to be quite interchangeable.
--
Cheese is not a hat. - clock
[ Parent ]
Also by Siddhi (3.00 / 0) #45 Mon Jun 28, 2004 at 02:05:08 PM EST
 I like the algorithm. Its simple and fast should work well for most practical cases.

Distinguishing between obfuscating punctuation and legal punctuation seems to be a particularly intractable problem.

The same goes with double letters...distinguishing between legal and illegal double letters. For example, in your code, the phrase "best pricee sofftware" would be left out, because the double e and double f are marked as valid.

I suspect anything more complex than this would be going into the realm of AI. This code could well win because other algorithms are too complex to work on the simple cases. At least that has been the result of my experiments so far.

[ Parent ]
Useful info by gazbo (3.00 / 0) #35 Mon Jun 28, 2004 at 02:53:01 AM EST
Just something I stumbled across in the PHP manual's user notes that may be useful and take some donkey work out:

return strtr($string, "ŠOeŽšoežŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ", "SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy");

A fairly useful repository of accented letters and their unaccented equivalents.


I recommend always assuming 7th normal form where items in a text column are not allowed to rhyme.

Is it just me... by edwin (5.00 / 1) #48 Mon Jun 28, 2004 at 03:26:28 PM EST
...or is there some rave tune secret message embedded in that string?
AAAAAACEEEEIIIID!!!

OK, it probably is just me.

[ Parent ]
Dictionary Input File by codemonkey uk (6.00 / 1) #49 Mon Jun 28, 2004 at 07:20:47 PM EST
You know, it might actually be better to have a large body of valid text than a straight dictionary file in order to use word use frequency as a way of selecting where there is abiguity.

--- Thad ---
Almost as Smart As you.
I was thinking about this myself by gazbo (3.00 / 0) #50 Mon Jun 28, 2004 at 09:15:48 PM EST
The plus side is obvious, but the minus side is that it will be harder for edwin to test, and also it adds an (admittedly minor) extra step to everyone's code.

If we were going to go that way, I'd suggest a dictionary file that had a ranking with each word.  Same net result, but the extra coding and pre-processing is done once, by edwin.

Same minus applies about it being harder work for him, of course.  And for this application I guess it's no real inconvenience to just choose a random word / longest word / whatever.  Or backing off if your code is really fast.


I recommend always assuming 7th normal form where items in a text column are not allowed to rhyme.

[ Parent ]