Dr. Yiu-Kai Dennis Ng, Department of Computer Science
Evaluation of the Academic Objects
In the 2007 MEG proposal, I specified the following Internet problems to be solved using a Fuzzy set information retrieval (IR) model: (i) detecting plagiarism, which is the act of using another’s words or ideas as one’s own, (ii) filtering junk emails, which are undesirable because junk emails waste valuable resources and time and include offensive content in addition to the monetary cost that reaches billions of dollars per year, the bill that is paid by public users, and (iii) identifying spam Web pages that include contents that are useless to the Web users. We have successfully found the solution to each of the proposed problems to be solved, and our claim is supported by the articles published in the academic journals and conference proceedings (see Section 4 for the list of published work). In addition, Maria Soledad (Sole) Pera, one of my former M.S. students and my current Ph.D. student, has been actively involved in this MEG project and assisting me throughout the past 2 1⁄2 years in this research work and its publications. Sole has gained valuable experience working on the various research problems involved in this mentoring project, successfully solved the problems, and published a number of articles that are the results of the research work conducted in this funded project.
Evaluation of the Mentoring Environment
Starting from May 1, 2007, Sole Pera and I have been meeting on a weekly basis to (i) discuss the progress on each individual project, (ii) review potential solutions to various technical problems encountered, and (iii) set different milestones and time frames for individual projects. In addition, I have reviewed and edited technical reports prepared by Sole for publication.
Participated Students and Academic Outcomes
Since May 1, 2007, Sole has been working with me on the MEG project. During the past 2 1⁄2 years, Sole was involved in designing and implementing the solutions to various problems in detecting plagiarism and spam emails/documents. In addition, she has co-authored a number of peer-reviewed journal and conference articles with me, which are by-products of the funded project. During the process, Sole has become a more natured researcher and a better technical writer. She has been continuing to improve her problem solving, analytical, and writing skills, which are developed as a result of this mentioning project and should be treated as a significant outcome of the project.
The following journal and conference articles were written and published based on the research results of the funded research project:
- Maria Soledad Pera and Yiu-Kai Ng, A Structure, Content-Similarity Measure for Detecting Spam Documents on the Web. To appear in the International Journal of Web Information Systems (IJWIS), Volume 5, Issue 4, December 2009, Emerald Group Publishing Ltd.
- Maria Soledad Pera and Yiu-Kai Ng, SpamED: A Spam Email Detection Approach Based on Phrase Similarity. Journal of the American Society for Information Science and Technology (JASIST), Volume 60, Issue 2, pp. 393-409, February 2009, Wiley.
- Nathaniel Gustafson, Maria Soledad Pera, and Yiu-Kai Ng, Nowhere to Hide: Finding Plagiarized Documents Based on Sentence Similarity. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence (WI’08), pp. 690-696, December 9-12, 2008, Sydney, Australia.
- Maria Soledad Pera and Yiu-Kai Ng, Identifying Spam Web Pages Based on Content Similarity. In Proceedings of the 2008 International Conference on Computational Science and Its Applications (ICCSA 2008), pp. 204-219, LNCS 5073, Springer, June 30-July 3, 2008, Perugia, Italy.
- Maria Soledad Pera and Yiu-Kai Ng, Using Word Similarity to Eradicate Junk Emails. In Proceedings of the ACM Sixteen Conference on Information and Knowledge Management (CIKM 2007), pp. 943-946, November 6-8, 2007, Lisboa, Portugal.
Sole really enjoys problem solving and the challenges of finding solutions to various research problems at different levels of complexity. The work experience obtained as a result of the mentored research project further motivates her to pursue more challenging research problems in her area of study and reach another milestone, i.e., a Ph.D. degree in Computer Science. Upon graduating with a M.S. degree in Computer Science in April 2009, Sole was accepted into our Ph.D. degree program, starting in Fall 2009. I am glad to have another opportunity to mentor Sole for the next 4 years and plan to work with her on another ORAC MEG project.
Results and Findings
Sole Pera and I have been working closely together during the past 2 1⁄2 years in solving the problems addressed in the MEG proposal. The results and findings of our research project are summarized below.
- Plagiarism Detection. Sole Pera, Nathaniel Gustafson (who was an undergraduate student involved in the plagiarism detection project and was funded by my department as a research assistant on the jointed research project), and I applied the word-correlation factors in the Fuzzy set information retrieval model in detecting similar documents, which include documents that are plagiarized. Our plagiarism-detection approach (i) establishes the degree of resemblance between any two documents D1 and D2 based on their sentence-to-sentence similarity computed by using pre-defined word-correlation factors, and (ii) generates a graphical view of sentences that are similar (or the same) in D1 and D2. Experimental results have verified that our plagiarism- detection approach is highly accurate in detecting (non-)plagiarized documents and outperforms existing plagiarism-detection approaches.
- Filtering Junk Emails. We rely on the content similarity of emails to eradicate junk emails. Our junk email filtering approach compares each incoming email to a core of emails marked as junk by each individual user to identify unwanted emails while reducing the number of misclassified legitimate emails using the similarity of phrases in emails to detect junk emails. Conducted experiments not only have verified that our filtering approach using trigrams in emails is capable of minimizing errors in junk-email detection, but it also performs better than a number of existing email filtering approaches with a 96% accuracy rate.
- Discovering Spam Web Pages. We have developed a novel method for identifying spam Web pages that include mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, our spam detection tool is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for document content analysis are pre-computed. We have verified that our spam-detection method outperforms existing anti-spam methods by an average of 10% in terms of minimizing the errors in spam Web document detection.
Summary of Expenses
Given below is the summary of the expenses (up till September 30, 2009), which shows how the research funds were used:
Total Amount Funded: $15,001
Student Wages: $11,471.44
Student Travel: $1,747.50
Remaining Balance: $1,782.06
The student wages were used for paying Sole Pera’s salary, whereas the student travel expenses covered Sole Pera’s ACM CIKM conference trip to Lisboa, Portugal in November 2007. The remaining balance will be spent on Sole Pera’s student salary for the rest of the calendar year while we complete the last technical report on the funded project.