PEG Judge - CCC02 Stage 2

2002 Canadian Computing Competition, Stage 2

Day 1, Problem 1: Spam

Unsolicited email (spam) is annoying and clutters your mailbox. You are to write a spam filter - a program that reads email messages of regular ASCII characters and tries to determine whether or not each message is spam.

How can we determine whether or not a message is spam? Spam contains words and phrases that are not common in genuine email messages. For example, the phrase

MAKE MONEY FAST, HONEY!!

is in all-uppercase, contains the word "money" and ends with a double exclamation mark.

One way to create a spam filter is to read through many spam and non-spam messages and to come up with a set of rules that will classify any particular message as spam or not. This process can be tedious and error prone to do manually. Instead you will write a program to automate the process.

A useful step in automatic classification is to split the text up into set of trigrams. A trigram is a sequence of three adjacent characters that appear in the message. A trigram is case sensitive. The example above is composed of the trigrams:

MAK
 
AKE
 
KE
 
E M
 
 MO
 
MON
 
ONE
 
NEY
 
EY 
 
Y F
 
 FA
 
FAS
 
AST
 
ST,
 
T,
 
, H
 
 HO
 
HON
 
ONE
 
NEY
 
EY!
 
Y!!

If we examine a sample of spam and non-spam messages we find that some trigrams are more common in spam; whereas others are more common in non-spam. This observation leads to a classification method:

Examine a sample consisting of a large number of spam messages. Count the number of times that each trigram occurs. In the example above, there are 20 distinct trigrams; the trigrams ONE and NEY occur twice each and the remaining 18 trigrams occur once each. (Trigrams that do not occur are considered to occur 0 times.) More formally, for each trigram t we compute the frequency f_spam(t) with which it occurs in the sample of spam.
Examine a sample consisting of a large number of non-spam messages. Compute f_non-spam(t), the frequency with which each trigram t appears in the sample of non-spam.
For a each message to be filtered, compute f_message(t) for each trigram t.
If f_message resembles f_spam more closely than it resembles f_non-spam it is determined to be spam; otherwise it is determined to be non-spam.
A similarity measure determines how closely f₁ and f₂ resemble one another. One of the simplest measures is the cosine measure:

$\displaystyle\textrm{similarity}(f_1,f_2) = \frac{\sum_t^{} f_1(t) \cdot f_2(t)} {\sqrt{\sum_t^{}\left[f_1(t)\right]^2} \cdot \sqrt{\sum_t^{}\left[f_2(t)\right]^2}}$

Then we say that a message is spam if

similarity(f_message, f_spam) > similarity(f_message, f_non-spam)

Input

The first line of input contains three integers: s the number of sample spam messages to follow; n the number of sample non-spam messages to follow; c the number of messages to be classified as spam or non-spam, based on trigram the trigram frequencies of the sample messages. Each message consists of several lines of text and is terminated by a line containing "ENDMESSAGE". This line will not appear elsewhere in the input, and is not considered part of the message.

Output

For each of the c messages, your program will output two lines. On the first line, output similarity(f_message, f_spam) and similarity(f_message, f_non-spam). On the second line print the classification of the message ("spam" or "non-spam"). Round the numbers to five decimal digits.

When forming trigrams, we never include a newline character. We don't include trigrams that span multiple lines, either. So in the first spam message of Sample Input 1, the only trigrams are:

"AAA", "BBB", "BB ", "B ", " C", " CC", and "CCC".

Sample Input 1

2 1 1
AAAA
BBBB  CCCC
ENDMESSAGE
BBBB
ENDMESSAGE
AAAABBBB
ENDMESSAGE
AAABB
ENDMESSAGE

Sample Output 1

0.21822 0.73030
 
non-spam

Sample Input 2

Found in this file

Sample Output 2

0.28761 0.20595
 
spam
 
0.44314 0.49243
non-spam

All Submissions
Best Solutions

Point Value: 10
Time Limit: 3.00s
Memory Limit: 16M
Added: Apr 17, 2009

Languages Allowed:
C++03, PAS, C, HASK, ASM, RUBY, PYTH2, JAVA, PHP, SCM, CAML, PERL, C#, C++11, PYTH3

Comments (Search)

EasierPlanet3 on May 31, 2015 - 11:55:04 pm UTC Off by one 100000th

(0/0)

On the "spam" component on the last test case only. Can anyone suggest a reason why?  Thanks,

On the "spam" component on the last test case only. Can anyone suggest a reason why? Thanks,

Alex on Sep 28, 2014 - 8:31:49 am UTC Clarification on bounds

(0/0)

The lengths of messages will not exceed ~5000 characters. In addition, 1 <= [i]s[/i], [i]n[/i], [i]c[/i] <= 5.

The lengths of messages will not exceed ~5000 characters. In addition, 1 <= s, n, c <= 5.

jargon on Apr 18, 2009 - 5:56:19 pm UTC Could someone explain...

(0/2)

... how the cosine measure works?

... how the cosine measure works?

bbi5291 on Apr 18, 2009 - 6:10:59 pm UTC Re: Could someone explain...

(3/0)

Let (x0, x1, x2, x3, ...) be the trigraph frequencies for X, and (y0, y1, y2, y3, ...) be the trigraph frequencies for Y. Then, the similarity between X and Y is given by
(x0*y0 + x1*y1 + x2*y2 + x3*y3 + ...) / (sqrt(x0^2 + x1^2 + x2^2 + x3^2 + ...) sqrt(y0^2 + y1^2 + y2^2 + y3^2 + ...))

Essentially, if X and Y are vectors to the points (x0, x1, x2, x3, ...) and (y0, y1, y2, y3, ...) in N-dimensional space, then the cosine measure gives the cosine of the angle between these two vectors. Since the cosine is monotone decreasing on [0,pi], the higher the cosine measure, the smaller the angle (that is, the closer are the vectors X and Y). It also follows that the cosine measure has an absolute value less than or equal to one.

Let (x0, x1, x2, x3, ...) be the trigraph frequencies for X, and (y0, y1, y2, y3, ...) be the trigraph frequencies for Y. Then, the similarity between X and Y is given by
(x0*y0 + x1*y1 + x2*y2 + x3*y3 + ...) / (sqrt(x0^2 + x1^2 + x2^2 + x3^2 + ...) sqrt(y0^2 + y1^2 + y2^2 + y3^2 + ...))

Essentially, if X and Y are vectors to the points (x0, x1, x2, x3, ...) and (y0, y1, y2, y3, ...) in N-dimensional space, then the cosine measure gives the cosine of the angle between these two vectors. Since the cosine is monotone decreasing on [0,pi], the higher the cosine measure, the smaller the angle (that is, the closer are the vectors X and Y). It also follows that the cosine measure has an absolute value less than or equal to one.