Xanax as X.a.n.a.x Viagra as V.14.grA mortgage as m or tg agé
(The last one is interesting - it means you can't rely on whitespace delimiting words.)
Your task, should you choose to accept it, is to attempt to reverse the process: write a program which reveals the actual words. This could be (and doubtless is) used as a component in a smap-killing program. Obviously it is not possible to do this perfectly, so your program will need to take a heuristic approach.
Details: Your program will be passed one command-line argument: a file name containing a dictionary of English words, one-per-line. This will be in the same format as the hoary Unix /usr/dict/words file, and will in fact probably be that file with some extra spam-related words added.
Your program should read lines containing spam from standard input and write your version of the correct output to standard out. Each line of input will contain 0 or more words and bits of punctuation. It won't contain HTML tags or other markup. It'll be in some ill-defined 8-bit encoding. When you reach end-of-file, quit.
You should assume that the output will be used by another program, not a human, so you don't need to worry about preserving capitalization or punctuation.
Judging: I will be looking for safety, accuracy, elegance, and speed, in roughly that order.
- Safety: if your program crashes or I can see a buffer overflow, it's disqualified.
- Accuracy: the more your output corresponds to my notion of the words that are present, the better. If I get time, I may also actually hook your program up to a spam detector and look at how much you improve or degrade the false-accept and false-reject rates.
- Elegance: clean, comprehensible, "sweet" programs are preferred over brute-force spaghetti code.
- Speed: I should be able to run this on 1MB of input without having to wait more than 30 seconds.
Entering: Post a top-level comment announcing your code (don't describe it there, though!) and then a reply to it that contains the code itself, along with a description of the algorithm used and instructions for making the program run (remember to format your code for posting using tmoertel's code formatting program as described in How to post code to K5 -- the easy way!). Please do not post code, algorithm descriptions, or any direct hints at solving the problem as root-level comments here — it could spoil the fun for people who haven't had a chance to take a stab at the problem yet.
Please have your entries in by midnight PST on Sunday July 18th.
|< Grab your elephant method | BBC White season: 'Rivers of Blood' >|