Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

An Experimental Assessment of AI-Based Legal Decision-Making Systems in Contract Analysis and Risk Detection

2025·3 Zitationen·Qubahan Techno JournalOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

This comprehensive experimental study evaluates the performance, reliability, and practical applicability of AI-based legal decision-making systems in contract analysis and risk detection. Utilizing a corpus of 5,247 contracts with expert-validated annotations from 12 legal professionals, we benchmark four classes of AI systems—rule-based, supervised machine learning (XGBoost), fine-tuned transformer models (Legal-BERT), and large language models (GPT-4, Claude 3)—across multiple dimensions critical to legal practice. The Key Findings of this research : a) Performance Variability: Fine-tuned Legal-BERT achieved the highest overall clause classification F1-score (0.923, 95% CI [0.917, 0.929]), but exhibited significant degradation in cross-jurisdictional applications (28.4% performance drop from US to UK contracts). b) Risk Detection Gaps: All systems demonstrated decreasing recall with increasing risk severity. GPT-4 missed 18.2% of high-severity risks (severity ≥4), while Legal-BERT missed 12.3% of total risk severity weight (FNRP metric). c)Decision Inconsistency: LLMs showed substantial inconsistency, with GPT-4 achieving only 0.81 intra-model Jaccard similarity across identical inputs and 14.7% decision variation on identical clause phrasings. d)Domain-Specific Performance: Rule-based systems performed adequately on standardized agreements (NDA: F1=0.812) but failed catastrophically on complex contracts (M&A: F1=0.432). e) Cost-Effectiveness: Local fine-tuned models provided 92.3% of GPT-4's performance at 3.5% of the cost ($0.0087 vs $0.2478 per document).We introduce two novel legal-specific metrics—False-Negative Risk Penalty (FNRP) and Severity-Weighted F1 (SwF1)—that better capture the asymmetric cost structure of legal errors. Based on our empirical findings, we propose a three-tier human-in-the-loop deployment framework that reduces attorney review time by 64% while maintaining 99.7% risk coverage. The study establishes evidence-based performance thresholds for safe deployment, recommending against autonomous use of any system with FNRP > 0.15 or cross-jurisdiction performance degradation > 25%. Our findings challenge optimistic claims of AI autonomy in legal decision-making and provide a rigorous, reproducible framework for evaluating legal AI systems in practice.

Autoren

Institutionen

Themen

Artificial Intelligence in LawArtificial Intelligence in Healthcare and EducationEthics and Social Impacts of AI

Volltext beim Verlag öffnen

An Experimental Assessment of AI-Based Legal Decision-Making Systems in Contract Analysis and Risk Detection

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen