Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Training Generative Judge with Hard Negative Mining: A Metric Learning Perspective

2025·0 Zitationen

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Prior SFT-based LLM-as-a-Judge suffers from a distributional mismatch between GPT-4 and the smaller opensource language models (SLMs), often leading to overfitting. On-policy methods, such as Group Relative Policy Optimization (GRPO), attempt to address this limitation by training models using their own outputs. However, conventional GRPO typically rely on randomly sampled data, often selecting negative samples with low model-assigned probabilities-samples that provide weak gradient signals and thus limit learning efficacy. To overcome this, we first establish a theoretical connection by proving that GRPO is equivalent to the margin ranking loss commonly used in metric learning. Building on this, we introduce the first approach to optimize GRPO through hard negative mining, a well-established technique in metric learning. We define hard negatives within the context of LLM-as-a-Judge and propose a simple yet effective strategy for identifying them. We evaluate our method on three out-of-domain datasets and one in-domain dataset. Compared to SFT-based methods, our approach demonstrates superior scalability, which is becoming increasingly important for training on modern, largescale datasets. Specifically, while SFT-based methods struggle to improve performance when scaling from 10 % to 100 % of the training data, our method exhibits consistent improvements across this range. This suggests that current SFT methods may not be well-suited for training generative judges. Furthermore, empirical results show that our approach consistently outperforms larger open-source baselines.

Autoren

Changxin Chen

Institutionen

Shanghai University of Finance and Economics(CN)

Themen

Topic ModelingNatural Language Processing TechniquesArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Training Generative Judge with Hard Negative Mining: A Metric Learning Perspective

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen