🕵 ReMoDetect: Reward Models
Recognize Aligned LLM's Generations

Hyunseok Lee* 1,   Jihoon Tack* 1,   Jinwoo Shin1

1 Korea Advanced Institute of Science and Technology
* Equal Contribution


[Paper]            [Code]            [Twitter]

Abstract

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results.

🔍 Observation: Reward Models Recognize Aligned LLM's Generations

Observation: As aligned LLMs are optimized to maximize human preferences, they generate texts with higher predicted rewards even compared to human-written texts. We visualize the (i) t-SNE of the reward model's final feature and the (ii) histogram of the predicted reward score. Here, `Machine' indicates the text generated by GPT3.5/GPT4 Turbo, Llama3-70B instruct, and Claude3 Opus. Based on this, one can easily distinguish LLM-generated texts from human-written texts by simply using the predicted score of the reward model as the detection criteria, e.g., AUROC of 92.8% when detecting GPT4 generated texts.

🕵 ReMoDetect: Detecting LLM’s Generations using Reward Models

We propose ReMoDetect, a novel and effective aligned LGT detection framework using the reward model. In a nutshell, ReMoDetect is comprised of two training components to improve the detection ability of the reward model. First, to further increase the separation of the predicted reward between LGTs and human-written texts, we continually fine-tune the reward model to predict even higher reward scores for LGTs compared to human-written-texts while preventing the overfitting bias using the replay technique. Second, we generate an additional preference dataset for reward model fine-tuning, namely the Human/LLM mixed text; we partially rephrase the human-written text using LLM. Here, such texts are used as a median preference corpus among the human-written text and LGT corpora, enabling the detector to learn a better decision boundary.

📊 Experimental Results

LGT detection performance: We present the LGT detection performance of ReMoDetect and other detection baselines. Overall, ReMoDetect significantly outperforms prior detection methods by a large margin, achieving state-of-the-art performance in average AUROC.


Comparison with a commercial detection method: We also compare ReMoDetect with a commercial LGT detection method, GPTZero, under the Fast-DetectGPT benchmark. As shown in the table above, ReMoDetect outperforms GPTZero in all considered aligned LLMs except for one in terms of the average AUROC.


Robustness against rephrasing attacks: One possible challenging scenario is detecting the rephrased texts by another LM (known as rephrasing attacks), i.e., first generate texts with powerful LLMs and later modify them with another LLM. As shown in the above table, RoMoDetect significantly and consistently outperforms all baselines.


Robustness on input response length: We also measure the robustness of RoMoDetect on the input response length (i.e., # of words in y). Interestingly, our method can even outperform the best baseline with 71.4% fewer words, showing significant robustness on short input responses.

Citation

@article{lee2024remodetect,
     title={ReMoDetect: Reward Models Recognize Aligned LLM's Generations},
     author={Lee, Hyunseok and Tack, Jihoon and Shin, Jinwoo},
     journal={arXiv preprint arXiv:2405.17382},
     year={2024},
}