Clever Spam
By whazat (Sun Jun 13, 2004 at 09:56:07 AM EST) (all tags)
I just got some interesting spam the text of which is enclosed in the body.

Fight against spam Poll

Hello, I have a special_offer for you...
WANT TO LOSE WEIGHT?
The most powerful weightloss is now available
100% Money Back Guarantée!
• Lose up to 19% Total Body Weight.
• Up to 300% more Weight Loss while dieting.
• Loss of 20-35% abdominal Fat.
• Reduction of 40-70% overall Fat under skin.
• Increase metabolic rate by 76.9% without Exercise.
• Boost your Confidence level and Self Esteem.
• Burns calorized fat.
• Suppresses appetite for sugar.

---- system information ----
graphics believes mail months cultural its Australia obtains
takes invite back limited as reflection Please those
means allows while information to Use internal patterns
like items world POSIX However alone doesn't One
believes Simplified environments throughly taking deduced When
populate

So if if I remember bayesian filters well, if you mark this as spam, and lots of other messages with "posix" and and "environments" slowly but surely your bayesian filter will get more and more fucked up and register too many false positives.

If so expect lots of spam with Quake, Osama bin laden, NVIDIA and other common phrases people use in pusposeful messages tacked onto the end in order to spoil the effectiveness of filters.

Winning vs Spam by codemonkey uk (6.00 / 2) #1 Sun Jun 13, 2004 at 10:15:45 AM EST
Once you are on their lists, you've already lost. Winning is staying off the mailing lists.

Almost as Smart As you.
You should be okay. by squigs (3.00 / 0) #2 Sun Jun 13, 2004 at 11:05:57 AM EST
If you talk a lot about posix environments, then that will suggest a non-spam value for this specific email.  On the other hand,  it uses "Weightloss" and "Prescription".  I'll have to check, but I suspect none of my none spam emails have either of these terms.  If they keep trying the same tricks, then "Guarantée!" will be added as well.

Dieticians may well find a number of these get through, but then, they may tend not to have the words "WANT" and "LOSE", assuming the filter is case sensitive.  It would be foolish not to be since you can get a fairly accurate filter just by counting the number of capitals.

Missing poll option: by Vladinator (3.00 / 0) #3 Sun Jun 13, 2004 at 12:16:52 PM EST
"With high explosives"
--

LRSE Hosting. We do weblog hosting.

Two things. by ambrosen (6.00 / 1) #4 Sun Jun 13, 2004 at 12:17:59 PM EST
I'm surprised no-one uses a bigram model to stop the random word lists. I did see a paper about it a while back, but I didn't read in detail. IIRC it was at CLUK 2004 in Birmingham for the interested.

Also, I got an email from Citi Identity Theft Solutions today, which amused me. "Oh, Thelma, shall I do what these nice Identity Thieves are suggesting?". Made me laugh, anyway.

bigrams by martingale (3.00 / 0) #5 Sun Jun 13, 2004 at 12:24:37 PM EST
Various people have tried (and do use) bigram models, but there are downsides. The downsides are: much bigger storage reqs (hint: it's not just twice as many words, it's more like ten times as many), much bigger training sample sizes (more tokens=more data needed to estimate each token prob), accuracy gain is negligible.
--
$E(X_t|F_s) = X_s,\quad t > s$
[ Parent ]
Well, by ambrosen (3.00 / 0) #7 Sun Jun 13, 2004 at 12:27:39 PM EST
a good bigram model of English isn't a big thing, and at least that should do for English text, at least as a warning. I mean, you could tag the text and use bigrams of tags, as well.

[ Parent ]
nah by martingale (3.00 / 0) #9 Sun Jun 13, 2004 at 12:37:33 PM EST
1) Spammers don't speak english

2) You still need lots of data to learn the English bigram model. Just like you need lots of data to learn a spam bigram model. Here "lots" means "a lot more than with a unigram model".
--
$E(X_t|F_s) = X_s,\quad t > s$

[ Parent ]
Fine. by ambrosen (3.00 / 0) #11 Sun Jun 13, 2004 at 09:49:00 PM EST
I've got a partition here with 100 million words of English as she is spoke on it. That's enough to do a reasonable bigram model.

[ Parent ]
ok by martingale (3.00 / 0) #12 Sun Jun 13, 2004 at 11:17:18 PM EST
That's plenty. How much storage does it take? (count the unique words + count the unique bigrams) * record size.
--
$E(X_t|F_s) = X_s,\quad t > s$
[ Parent ]
Not sure. by ambrosen (3.00 / 0) #13 Sun Jun 13, 2004 at 11:23:16 PM EST
But I'd be surprised if it was more than 10MB.

[ Parent ]
be surprised by martingale (3.00 / 0) #15 Mon Jun 14, 2004 at 12:03:49 AM EST
The Gutenberg collected works of Mark Twain is an ascii file with 50096 unigrams (converted to lower case), or 62936 unigrams (original case). The same file has 919357 bigrams (lower case), or 1018764 bigrams (original case). Now typical spam filters use about 10 bytes or more for each feature, so keeping original case and both bigrams and unigrams would require about 20 MB. Now most of the features occur only once in that corpus, which isn't very accurate if you want to estimate occurrence probabilities for rarish words. With your 100 million word English corpus, you would be looking at a fairly large storage requirement, methinks. Of course, if you use that much data, you'd better be more than only a few percent more accurate.
--
$E(X_t|F_s) = X_s,\quad t > s$
[ Parent ]
Bizzarrly enough. by ambrosen (3.00 / 0) #6 Sun Jun 13, 2004 at 12:26:02 PM EST
I went to the address given in the email, and it redirects to a citibank.com address that does ask for your PIN and ID number. Obviously I didn't enter it, being a Lloyds TSB customer, but if they redirected a faked site (and it was a numeric address given (http://61.128.198.62/verify/)), you'd think they'd at least tell you not to reply to spams.

[ Parent ]
you remember wrong by martingale (3.00 / 0) #8 Sun Jun 13, 2004 at 12:33:08 PM EST
A good bayesian filter factors in many, many words. A typical spam has maybe 300 words, and you're worried about POSIX? Let's say the spammer picks 10 of your most frequent good words, that's 1/30 of the message. That's 29/30 of the message containing "bad" or unknown words.

Now say the spam got through, because the 10 good words were so good they swamped the 290 other word probabilities, taken together. It's either a fluke, or if the spammer reuses those 10 words again and again, then those examples will cancel out the likelihood of being good. Your really "good" words are now a different bunch, and the spammer has painted himself into a corner.
--
$E(X_t|F_s) = X_s,\quad t > s$

My worry was that by whazat (3.00 / 0) #10 Sun Jun 13, 2004 at 08:24:45 PM EST
He has made words that you commonly use to be tags for spam, making your filter pick up too many of your normal emails as useless and so making your spam filter less likely to be used (as you don't want to miss useful emails).

--
The revolution will not be realised
[ Parent ]
clarification by martingale (3.00 / 0) #14 Sun Jun 13, 2004 at 11:28:30 PM EST
Remember that what matters is not the individual words, but the common statistical frequencies. Think of it this way: if your name commonly appears as Whazat Comagin, then the frequencies for Whazat will be identical to the frequencies for Comagin. So you can think of the pair (Whazat,Comagin) as an induced "token". Now say the spammer tries to lower the whazat word by using it in spam, then he's attacking the "whazat" token, not the (Whazat,Comagin) token. So there's a discrepancy which a proper statistical filter ought to detect automatically. See, what matters isn't just the individual words, but all the words taken together as a whole, with their relative frequencies.

The other thing is that even if the spammer uses whazat a whole lot, when he grows tired of using it, the fact that your good email uses it constantly will make it automatically become more important again. The important words in your good email simply shift to whichever happen to be most informative.
--
$E(X_t|F_s) = X_s,\quad t > s$

[ Parent ]