Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Tibetan-LLaMA 2: Large Language Model for Tibetan

2025·0 Zitationen·ACM Transactions on Asian and Low-Resource Language Information Processing

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Large language models (LLMs), such as ChatGPT and LLama, have shown remarkable capability in a wide range of natural language tasks. However, the current LLMs are mainly concentrated in resource-rich languages, such as English and Chinese. For low-resource language such as Tibetan, research and applications related to LLMs are still in their infancy. To address the existing gap, we present a method to enhance LLaMA with the ability to understand and generate Tibetan text, as well as to follow instructions. This is achieved by creating large-scale unsupervised pre-training and supervised fine-tuning datasets, which mitigate the limited availability of Tibetan data. Additionally, we expand LLaMA’s vocabulary by incorporating Tibetan tokens through Unigram tokenization, thereby improving both its encoding efficiency and semantic understanding of Tibetan. Furthermore, we conduct secondary pre-training and fine-tune the model using the constructed datasets, thereby enhancing its capability to interpret and execute instructions effectively. To verify the effectiveness of the model, we establish ten evaluation benchmarks for Tibetan. The experimental results indicate that the proposed model significantly enhances the LLaMA’s proficiency in understanding and generating Tibetan content. To promote further research, we release our model and inference resources at https://github.com/Shajiu/Tibetan-LLaMA-2 .

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationBig Data and Digital Economy

Volltext beim Verlag öffnen

Tibetan-LLaMA 2: Large Language Model for Tibetan

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen