Classify with k neighbor

What is the challenge for this part?

In the challenge of this part, thousands of comments in files would be classfied to deceptive or truthful as train data. And my mission is to classify test data as train data with algorithm.

Use tfidf to identify keyword from corpus

First, I must find out a feature that can help me differ the deceptive comment from truthful comment. For comment, the feature might be the length of words, or the occurence of some keywords. After giving up on the length of words, I start to find the occurence of some keywords, I write a python script to find out the words which have top frequency in deceptive train data and truthful train data. But I soon run into a big problem after the result comes out: There are many duplicate word between deceptive one and truthful one, and the words in the result even cannot be the keyword such as some words like I, am, is… and so on. This kind of words are called as stopwords. Therefore, I need to make some preprocessing before I start to find keywords.

I would recommend some class of python which can help process data conveniently. To remove the special characters, I use the collection of string.punctuation. The collection automatically contains !, ,, ', and so on. Then to remove the stopwords, there is the class of stopwords in nltk.corpus.

stop_words = set(stopwords.words('english'))

Now, I have the set of stopwords and it would be easy to remove it all from my data. Until here, I have completed all the work of preprocessing.

However, is it enough to find the keyword with frequency? Take example for school comments, there should be high frequency of the word school in both deceptive and truthful comments because it is the topic, and the preprocessing for stopwords also cannot remove it. Therefore, I would change to use tfidf here to find the keyword! I won’t explain the definition of tfidf here but I would give recommended reading source in the reference. So, it is also convenient to get result of tfidf with python library.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

corpus = ['I am apple', 'I am orange'...]

vectorizer = CountVectorizer()
transformer = TfidfTransformer()
# tf
x = vectorizer.fit_transform(corpus)
keyword = vectorizer.get_feature_names()
# idf
tfidf = transformer.fit_transform(x)
tfidf = tfidf.toarray()

The result of tfidf is a dimensional array whose rows correspond to each data in the corpus, and columns correspond to the keyword from get_feature_names(). And each value of the data point in the two dimensional array is the frequency of the word in each data. Find the top value and it could be the best keyword for corresponding data!

Takeaway from this part:

  • Using tfidf can help finding keywords from corpus efficiently.

Use k-neighbor to help classification

In the challenge, I am required to find out k most nearest data points to classify my test data. Here are two more problems I need to solve in this part:

  1. define the distance for nearest
  2. find the best k

For the first problem, I use Euclidean distance directly at the first time. But, in fact, it is only appropriate when data points are sparse and not so much. Here I change to use cosine similarity to define the distance between different vectors in tfidf. For cosine similarity, if two vectors are related to each others so much, the angle between two vectors would be near to 0, and the value of cosine would be 1. And it is also easy to get the value with 1 - spatial.distance.cosine(a, b) while a, b are two different vectors!

For the second problem, I use cross validation to find the possible k.

Takeaway from this part:

  • To get the similarity between two data points, distance is not the only way. And Euclidean is also not the only way to calculate the distance. It depends on the feature of our dataset.


I end up with accuracy of 0.8 and learn more something from the people who get the accuracy of 0.9. In the work of preprocessing, he remove the stop words not only with the default list from library, but also remove some words which have high frquency in each comments (in other words, he update his own stop words list). I think I might need more work on preprocessing.


The recommended article to read about tfidf and cosine similarity