Date Approved
11-23-2023
Embargo Period
11-27-2023
Document Type
Thesis
Degree Name
Master of Science (M.S.)
Department
Computer Science
College
College of Science & Mathematics
Advisor
Anthony Breitzman, Ph.D.
Committee Member 1
Shen-Shyang Ho, Ph.D.
Committee Member 2
Bo Sun, Ph.D.
Keywords
Document similarity; Information Retrieval; Machine Learning; Natural Language Processing
Subject(s)
Artificial intelligence
Disciplines
Computer Sciences | Physical Sciences and Mathematics
Abstract
Document similarity, a core theme in Information Retrieval (IR), is a machine learning (ML) task associated with natural language processing (NLP). It is a measure of the distance between two documents given a set of rules. For the purpose of this thesis, two documents are similar if they are semantically alike, and describe similar concepts. While document similarity can be applied to multiple tasks, we focus our work on the accuracy of models in detecting referenced papers as similar documents using their sub max similarity. Multiple approaches have been used to determine the similarity of documents in regards to literature reviews. Some of such approaches use the number of similar citations, the similarity between the body of text, and the figures present in those documents. This researcher hypothesized that documents with sections of high similarity(sub max) but a global low similarity are prone to being overlooked by existing models as the global score of the documents are used to measure similarity. In this study, we aim to detect, measure, and show the similarity of documents based on the maximum similarity of their subsections. The sub max of any two given documents is the subsections of those documents with the highest similarity. By comparing subsections of the documents in our corpus and using the sub max, we were able to improve the performance of some models by over 100%.
Recommended Citation
Igbiriki, Richard Imorobebh, "Enhancing Inter-Document Similarity Using Sub Max" (2023). Theses and Dissertations. 3170.
https://rdw.rowan.edu/etd/3170