Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Training Generative Judge with Hard Negative Mining: A Metric Learning Perspective
0
Zitationen
1
Autoren
2025
Jahr
Abstract
Prior SFT-based LLM-as-a-Judge suffers from a distributional mismatch between GPT-4 and the smaller opensource language models (SLMs), often leading to overfitting. On-policy methods, such as Group Relative Policy Optimization (GRPO), attempt to address this limitation by training models using their own outputs. However, conventional GRPO typically rely on randomly sampled data, often selecting negative samples with low model-assigned probabilities-samples that provide weak gradient signals and thus limit learning efficacy. To overcome this, we first establish a theoretical connection by proving that GRPO is equivalent to the margin ranking loss commonly used in metric learning. Building on this, we introduce the first approach to optimize GRPO through hard negative mining, a well-established technique in metric learning. We define hard negatives within the context of LLM-as-a-Judge and propose a simple yet effective strategy for identifying them. We evaluate our method on three out-of-domain datasets and one in-domain dataset. Compared to SFT-based methods, our approach demonstrates superior scalability, which is becoming increasingly important for training on modern, largescale datasets. Specifically, while SFT-based methods struggle to improve performance when scaling from 10 % to 100 % of the training data, our method exhibits consistent improvements across this range. This suggests that current SFT methods may not be well-suited for training generative judges. Furthermore, empirical results show that our approach consistently outperforms larger open-source baselines.