Date Approved

11-23-2023

Embargo Period

11-27-2023

Document Type

Thesis

Degree Name

Master of Science (M.S.)

Department

Computer Science

College

College of Science & Mathematics

Advisor

Anthony Breitzman, Ph.D.

Committee Member 1

Shen-Shyang Ho, Ph.D.

Committee Member 2

Bo Sun, Ph.D.

Keywords

Document similarity; Information Retrieval; Machine Learning; Natural Language Processing

Subject(s)

Artificial intelligence

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Abstract

Document similarity, a core theme in Information Retrieval (IR), is a machine learning (ML) task associated with natural language processing (NLP). It is a measure of the distance between two documents given a set of rules. For the purpose of this thesis, two documents are similar if they are semantically alike, and describe similar concepts. While document similarity can be applied to multiple tasks, we focus our work on the accuracy of models in detecting referenced papers as similar documents using their sub max similarity. Multiple approaches have been used to determine the similarity of documents in regards to literature reviews. Some of such approaches use the number of similar citations, the similarity between the body of text, and the figures present in those documents. This researcher hypothesized that documents with sections of high similarity(sub max) but a global low similarity are prone to being overlooked by existing models as the global score of the documents are used to measure similarity. In this study, we aim to detect, measure, and show the similarity of documents based on the maximum similarity of their subsections. The sub max of any two given documents is the subsections of those documents with the highest similarity. By comparing subsections of the documents in our corpus and using the sub max, we were able to improve the performance of some models by over 100%.

Share

COinS