Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Tibetan-LLaMA 2: Large Language Model for Tibetan
0
Zitationen
4
Autoren
2025
Jahr
Abstract
Large language models (LLMs), such as ChatGPT and LLama, have shown remarkable capability in a wide range of natural language tasks. However, the current LLMs are mainly concentrated in resource-rich languages, such as English and Chinese. For low-resource language such as Tibetan, research and applications related to LLMs are still in their infancy. To address the existing gap, we present a method to enhance LLaMA with the ability to understand and generate Tibetan text, as well as to follow instructions. This is achieved by creating large-scale unsupervised pre-training and supervised fine-tuning datasets, which mitigate the limited availability of Tibetan data. Additionally, we expand LLaMA’s vocabulary by incorporating Tibetan tokens through Unigram tokenization, thereby improving both its encoding efficiency and semantic understanding of Tibetan. Furthermore, we conduct secondary pre-training and fine-tune the model using the constructed datasets, thereby enhancing its capability to interpret and execute instructions effectively. To verify the effectiveness of the model, we establish ten evaluation benchmarks for Tibetan. The experimental results indicate that the proposed model significantly enhances the LLaMA’s proficiency in understanding and generating Tibetan content. To promote further research, we release our model and inference resources at https://github.com/Shajiu/Tibetan-LLaMA-2 .