Date Approved


Embargo Period


Document Type


Degree Name

Master of Science (M.S.)


Computer Science


College of Science & Mathematics


Anthony Breitzman, Ph.D.

Committee Member 1

Shen-Shyang Ho, Ph.D.

Committee Member 2

Bo Sun, Ph.D.


Document similarity; Information Retrieval; Machine Learning; Natural Language Processing


Artificial intelligence


Computer Sciences | Physical Sciences and Mathematics


Document similarity, a core theme in Information Retrieval (IR), is a machine learning (ML) task associated with natural language processing (NLP). It is a measure of the distance between two documents given a set of rules. For the purpose of this thesis, two documents are similar if they are semantically alike, and describe similar concepts. While document similarity can be applied to multiple tasks, we focus our work on the accuracy of models in detecting referenced papers as similar documents using their sub max similarity. Multiple approaches have been used to determine the similarity of documents in regards to literature reviews. Some of such approaches use the number of similar citations, the similarity between the body of text, and the figures present in those documents. This researcher hypothesized that documents with sections of high similarity(sub max) but a global low similarity are prone to being overlooked by existing models as the global score of the documents are used to measure similarity. In this study, we aim to detect, measure, and show the similarity of documents based on the maximum similarity of their subsections. The sub max of any two given documents is the subsections of those documents with the highest similarity. By comparing subsections of the documents in our corpus and using the sub max, we were able to improve the performance of some models by over 100%.