Date Approved

6-16-2026

Embargo Period

6-16-2026

Document Type

Thesis

Degree Name

M.S. Computer Science

Department

Computer Science

College

College of Science & Mathematics

Advisor

Shen-Shyang Ho, Ph.D.

Committee Member 1

Vasil Hnatyshin, Ph.D.

Committee Member 2

Charalampos (Babis) Papachristou, Ph.D.

Keywords

Imitation Learning;Inverse Reinforcement Learning;Multi-Agent Reinforcement Learning;Multi-Agent Systems;Policy Optimization;Reinforcement Learning

Abstract

Increasingly, automated agents operating in the physical world are expected to perform previously unseen tasks at levels comparable to human experts. Solutions to these challenges typically fall along a spectrum between model-based and model-free reinforcement learning (RL) approaches. Encouraging expert-like behavior in reinforcement learning is often difficult because reward functions can be challenging to specify accurately. Inverse reinforcement learning addresses this problem by inferring reward functions directly from expert demonstrations rather than relying on manually engineered objectives. This thesis examines the performance of both on-policy and off-policy model-free reinforcement learning algorithms, including Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG), within a cooperative environment. These approaches are compared against Multi-Agent Deep Deterministic Policy Gradient (MADDPG) and Multi-Agent Adversarial Inverse Reinforcement Learning (MA-AIRL). The work presents an empirical study based on a smart building and electric vehicle charging station operating under both cooperative and competitive conditions. Results demonstrate that MA-AIRL can recover reward functions that correlate strongly with expert behavior in high-dimensional cooperative environments, while also identifying conditions under which reward recovery remains tractable and circumstances in which it degrades. Furthermore, the findings suggest that when an agent applies a learned cooperative policy within an environment that does not support cooperative behavior, the resulting actions may be interpreted as a form of “coping mechanism”: constrained cooperation arising from environmental mismatch rather than a failure of optimization.

Share

COinS