IndoXLNet: Pre-Trained Language Model for Bahasa Indonesia

  IJETT-book-cover  International Journal of Engineering Trends and Technology (IJETT)          
  
© 2022 by IJETT Journal
Volume-70 Issue-5
Year of Publication : 2022
Authors : Thiffany Pratama, Suharjito
DOI :  10.14445/22315381/IJETT-V70I5P240

Citation 

MLA Style: Thiffany Pratama, and Suharjito. "IndoXLNet: Pre-Trained Language Model for Bahasa Indonesia." International Journal of Engineering Trends and Technology, vol. 70, no. 5, May. 2022, pp. 367-381. Crossref, https://doi.org/10.14445/22315381/IJETT-V70I5P240

APA Style:Thiffany Pratama, & Suharjito. (2022). IndoXLNet: Pre-Trained Language Model for Bahasa Indonesia. International Journal of Engineering Trends and Technology, 70(5), 367-381. https://doi.org/10.14445/22315381/IJETT-V70I5P240

Abstract
BERT has been widely adopted to create pre-trained models in various languages, one of which is IndoBERT, a BERT-based pre-trained model for Bahasa Indonesia. However, BERT still has limitations, neglectingthe masked token`s position and the difference between the pre-training and fine-tuning processes. XLNet has been proven to overcome the limitations of BERT by combining the autoregressive language model and autoencoding methods. Unfortunately, no one has developed a pre-trained XLNet model for Bahasa Indonesia. Therefore, this research aims to create a pre-trained XLNet model specifically for Bahasa Indonesia. This model can be used to solve Natural Language Processing problems in Bahasa Indonesia, such as sentiment analysis and named-entity recognition. The model is called IndoXLNet. IndoXLnet is trained using corpus datasets in Bahasa Indonesia to capture the context of the word representation in Bahasa Indonesia better than IndoBERT. It is proven that after testing various Natural Language Processing tasks on the IndoNLU benchmark, IndoXLNet`s average F1-score performance increased against IndoBERT by 3.06% with an equivalent architecture.

Keywords
Bahasa Indonesia, BERT, Natural Language Processing, Pre-trained Model, XLNet.

Reference
[1] T. Young, D. Hazarika, S. Poria and E. Cambria, Recent Trends in Deep Learning Based Natural Language Processing, IEEE Computational Intelligence Magazine. 13(3) (2018) 55-75.
[2] J. Devlin, M. W. Chang, K. Lee and K. Toutanova, BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding, in Comparative Analyses of Bert, Roberta, Distilbert, and Xlnet for Text-Based Emotion Recognition. (2019).
[3] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov and Q. V. Le, XLNet: Generalized Autoregressive Pretraining for Language Understanding, Advances in Neural Information Processing Systems. 32 (2019).
[4] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy and S. R. Bowman, GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, in Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. (2018).
[5] P. Rajpurkar, J. Zhang, K. Lopyrev and P. Liang, SQuAD: 100,000+ Questions for Machine Comprehension of Text, in Proceedings of the Conference on Empirical Methods in Natural Language Processing. (2016).
[6] S. Shahid, T. Singh, Y. Sharma and K. Sharma, Devising Malware Characteristics using Transformers, International Journal of Engineering Trends and Technology. 68(5) (2020) 33-37.
[7] R. Olaniyan, D. Stamate and I. Pu, A Two-Step Optimised BERT-Based NLP Algorithm for Extracting Sentiment from Financial News, in IFIP International Conference on Artificial Intelligence Applications and Innovations. (2021).
[8] X. Liu, G. L. Hersch, I. Khalil and M. Devarakonda, Clinical Trial Information Extraction with Bert, in IEEE 9th International Conference on Healthcare Informatics (ICHI). (2021).
[9] B. Wilie, K. Vincentio, G. I. Winata, S. Cahyawijaya, X. Li, Z. Y. Lim, S. Soleman, R. Mahendra, P. Fung, S. Bahar and A. Purwarianti, IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding, in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. (2020).
[10] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux and D. & Schwab, Flaubert: Unsupervised Language Model Pre-Training for French., arXiv preprint arXiv:1912.05372. (2019).
[11] L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de La Clergerie, D. Seddah and B. Sagot, Camembert: A Tasty French Language Model, arXiv preprint arXiv:1911.03894. (2019).
[12] S. Lee, H. Jang, Y. Baik, S. Park and H. Shin, Kr-Bert: A Small-Scale Korean-Specific Language Model, arXiv preprint arXiv:2008.03979. (2020).
[13] J. Canete, G. Chaperon, R. Fuentes, J. H. Ho, H. Kang and J. Pérez, Spanish Pre-Trained Bert Model and Evaluation Data, Pml4dc at ICLR. 2020 (2020) 2020.
[14] K. Gaanoun and I. Benelallam, Arabic Dialect Identification: An Arabic-BERT Model with Data Augmentation and Ensembling Strategy, In Proceedings of the Fifth Arabic Natural Language Processing Workshop. (2020).
[15] A. Conneau and G. Lample, Cross-Lingual Language Model Pre-Training, Advances in Neural Information Processing Systems. 32 (2019).
[16] G. Lai, Q. Xie, H. Liu, Y. Yang and E. Hovy, RACE: Large-scale Reading Comprehension Dataset From Examinations, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. (2017).
[17] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer, Deep Contextualized Word Representations, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1 (2018).
[18] (2018). A. Radford, K. Narasimhan, T. Salimans and I. Sutskever, Improving Language Understanding by Generative Pre-Training. [Online]. Available: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf.
[19] A. F. Aji, G. I. Winata, F. Koto, S. Cahyawijaya, A. Romadhony, R. Mahendra, K. Kurniawan, D. Moeljadi, R. E. Prasojo, T. Baldwin, J. H. Lau and S. Ruder, One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia, arXiv preprint arXiv:2203.13357. (2022).
[20] L. H. Suadaa, I. Santoso and A. T. B. Panjaitan, Transfer Learning of Pre-trained Transformers for Covid-19 Hoax Detection in Indonesian Language, IJCCS Indonesian Journal of Computing and Cybernetics Systems. 15(3) (2021).
[21] M. N. Fakhruzzaman and S. W. Gunawan, Web-based Application for Detecting Indonesian Clickbait Headlines using IndoBERT, arXiv preprint arXiv:2102.10601. (2021).
[22] A. Marpaung, R. Rismala and H. Nurrahmi, Hate Speech Detection in Indonesian Twitter Texts using Bidirectional Gated Recurrent Unit, in 13th International Conference on Knowledge and Smart Technology (KST). (2021).
[23] A. F. Adoma, N.-M. Henry and W. Chen, Comparative Analyses of Bert, Roberta, Distilbert, and Xlnet for Text-Based Emotion Recognition, in 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP). (2020).
[24] K. R. Scherer and H. G. Wallbott, Evidence for Universality and Cultural Variation of Differential Emotion Response Patterning, Journal of Personality and Social Psychology. 66(2) (1994) 310.
[25] C. Biemann, G. Heyer, U. Quasthoff and M. Richter, The Leipzig Corpora Collection-Monolingual Corpora of Standard Size, in Proceedings of Corpus Linguistic. (2007).