2002 Canadian Computing Competition, Stage 2

Day 1, Problem 1: Spam

Unsolicited email (spam) is annoying and clutters your mailbox. You are to write a spam filter - a program that reads email messages of regular ASCII characters and tries to determine whether or not each message is spam.

How can we determine whether or not a message is spam? Spam contains words and phrases that are not common in genuine email messages. For example, the phrase

MAKE MONEY FAST, HONEY!!

is in all-uppercase, contains the word "money" and ends with a double exclamation mark.

One way to create a spam filter is to read through many spam and non-spam messages and to come up with a set of rules that will classify any particular message as spam or not. This process can be tedious and error prone to do manually. Instead you will write a program to automate the process.

A useful step in automatic classification is to split the text up into set of trigrams. A trigram is a sequence of three adjacent characters that appear in the message. A trigram is case sensitive. The example above is composed of the trigrams:

MAK
AKE
KE
E M
MO
MON
ONE
NEY
EY
Y F
FA
FAS
AST
ST,
T,
, H
HO
HON
ONE
NEY
EY!
Y!!

If we examine a sample of spam and non-spam messages we find that some trigrams are more common in spam; whereas others are more common in non-spam. This observation leads to a classification method:

  • Examine a sample consisting of a large number of spam messages. Count the number of times that each trigram occurs. In the example above, there are 20 distinct trigrams; the trigrams ONE and NEY occur twice each and the remaining 18 trigrams occur once each. (Trigrams that do not occur are considered to occur 0 times.) More formally, for each trigram t we compute the frequency fspam(t) with which it occurs in the sample of spam.
  • Examine a sample consisting of a large number of non-spam messages. Compute fnon-spam(t), the frequency with which each trigram t appears in the sample of non-spam.
  • For a each message to be filtered, compute fmessage(t) for each trigram t.
  • If fmessage resembles fspam more closely than it resembles fnon-spam it is determined to be spam; otherwise it is determined to be non-spam.
  • A similarity measure determines how closely f1 and f2 resemble one another. One of the simplest measures is the cosine measure:


\displaystyle\textrm{similarity}(f_1,f_2) = \frac{\sum_t^{} f_1(t) \cdot f_2(t)}
{\sqrt{\sum_t^{}\left[f_1(t)\right]^2} \cdot \sqrt{\sum_t^{}\left[f_2(t)\right]^2}}

Then we say that a message is spam if

similarity(fmessage, fspam) > similarity(fmessage, fnon-spam)

Input

The first line of input contains three integers: s the number of sample spam messages to follow; n the number of sample non-spam messages to follow; c the number of messages to be classified as spam or non-spam, based on trigram the trigram frequencies of the sample messages. Each message consists of several lines of text and is terminated by a line containing "ENDMESSAGE". This line will not appear elsewhere in the input, and is not considered part of the message.

Output

For each of the c messages, your program will output two lines. On the first line, output similarity(fmessage, fspam) and similarity(fmessage, fnon-spam). On the second line print the classification of the message ("spam" or "non-spam"). Round the numbers to five decimal digits.

When forming trigrams, we never include a newline character. We don't include trigrams that span multiple lines, either. So in the first spam message of Sample Input 1, the only trigrams are:

"AAA", "BBB", "BB ", "B ", " C", " CC", and "CCC".

Sample Input 1

2 1 1
AAAA
BBBB  CCCC
ENDMESSAGE
BBBB
ENDMESSAGE
AAAABBBB
ENDMESSAGE
AAABB
ENDMESSAGE

Sample Output 1

0.21822 0.73030
non-spam

Sample Input 2

Found in this file

Sample Output 2

0.28761 0.20595
spam
0.44314 0.49243
non-spam

All Submissions
Best Solutions


Point Value: 10
Time Limit: 3.00s
Memory Limit: 16M
Added: Apr 17, 2009

Languages Allowed:
C++03, PAS, C, HASK, ASM, RUBY, PYTH2, JAVA, PHP, SCM, CAML, PERL, C#, C++11, PYTH3

Comments (Search)

On the "spam" component on the last test case only. Can anyone suggest a reason why? Thanks,

The lengths of messages will not exceed ~5000 characters. In addition, 1 <= s, n, c <= 5.

... how the cosine measure works?

Let (x0, x1, x2, x3, ...) be the trigraph frequencies for X, and (y0, y1, y2, y3, ...) be the trigraph frequencies for Y. Then, the similarity between X and Y is given by
(x0*y0 + x1*y1 + x2*y2 + x3*y3 + ...) / (sqrt(x0^2 + x1^2 + x2^2 + x3^2 + ...) sqrt(y0^2 + y1^2 + y2^2 + y3^2 + ...))

Essentially, if X and Y are vectors to the points (x0, x1, x2, x3, ...) and (y0, y1, y2, y3, ...) in N-dimensional space, then the cosine measure gives the cosine of the angle between these two vectors. Since the cosine is monotone decreasing on [0,pi], the higher the cosine measure, the smaller the angle (that is, the closer are the vectors X and Y). It also follows that the cosine measure has an absolute value less than or equal to one.