International Journal of Engineering
Trends and Technology

Research Article | Open Access | Download PDF
Volume 74 | Issue 4 | Year 2026 | Article Id. IJETT-V74I4P111 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I4P111

A Pre-processing Tool for Management of Aerospace OOV Words for Building a Bilingual Corpus of English to Assamese


Pratul Kalita, Saptarshi Paul, Bimal Kumar Kalita, Saurav Paul

Received Revised Accepted Published
03 Jun 2025 22 Jan 2026 19 Feb 2026 29 Apr 2026

Citation :

Pratul Kalita, Saptarshi Paul, Bimal Kumar Kalita, Saurav Paul, "A Pre-processing Tool for Management of Aerospace OOV Words for Building a Bilingual Corpus of English to Assamese," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 4, pp. 144-154, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I4P111

Abstract

OOV terminology and structured English phrases, which are exclusive to the aerospace industry, are crucial to this field. Translating these words and phrases is a difficult task. The use of social media, such as Facebook, Twitter, etc., by airlines as a significant tool for communication and advertising has led to the inclusion of these aerospace-related terms in everyday reading materials like newspapers. To make it easier for MT(Machine Translation) systems to translate these structured phrases and OOV terms, such as TFC, LEO, a tool that can recognize them and substitute them with the appropriate meaningful English word is urgently needed. For accurate translation, the MT(Machine Translation) systems must then be trained utilizing aerospace parallel corpora (English-target language). Another gap that must be remedied is the lack of such a corpus. There is a great demand for a tool that can handle both the direct feed to MT tools and assist in building a parallel corpus.

Keywords

Aerospace, OOV Terms, Machine Translation, Corpus.

References

[1] Airport Authority of India (AAI), 2026. [Online]. Available: https://www.aai.aero/

[2] All India Association of Industries (AIAI), Aerospace, 2026. [Online]. Available: https://aiaiindia.com/aerospace/

[3] Airport Authority of India (AAI), National Civil Aviation Policy 2016. [Online]. Available: https://www.aai.aero/en/node/4528

[4] MoCA-Ministry of Civil Aviation, 2026. [Online]. Available: https://www.civilaviation.gov.in/

[5] Zhongyu Zhuang et al., “Out-of-Vocabulary Word Embedding Learning based on Reading Comprehension Mechanism,” Natural Language Processing Journal, vol. 5, pp. 1-6, 2023.
[
CrossRef] [Google Scholar] [Publisher Link]

[6] Gregory S. Jones et al., Research Opportunities Aerospace Concepts, Hampton, Virginia: National Aeronautics and Space Administration (NASA), Langley Research Center, 2000. [Online]. Available: https://searchworks.stanford.edu/view/12264135

[7] Jeongin Kim, Taekeun Hong, and Pankoo Kim, “Replacing out-of-Vocabulary Words with an Appropriate Synonym based on Word2VnCR,” Mobile Information Systems, vol. 2021, no. 1, pp. 1-7, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Vijay Kumar Sharma, Namita Mittal, and Ankit Vidyarthi, “Context-based Translation for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual Information Retrieval,” IETE Technical Review, vol. 39, no. 2, pp. 276-285, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Rodali Assamese Keyboard. [Online]. Available: https://rodali-assamese-keyboard.en.softonic.com/android              

[10] Sahinur Rahman Laskar et al., “Improving English-Assamese Neural Machine Translation using Transliteration-based Approach,” Evolution in Computational Intelligence: Proceedings of the 10th International Conference on Frontiers in Intelligent Computing: Theory and Applications, Cardiff, United Kingdom, vol. 326, pp. 223-231, 2023.
[
CrossRef] [Google Scholar] [Publisher Link]

[11] Rudolf A. Braun, Srikanth Madikeri, and Petr Motlicek, “A Comparison of Methods for OOV-Word Recognition on a New Public Dataset,” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, pp. 5979-5983, 2021.
[
CrossRef] [Google Scholar] [Publisher Link]

[12] Sahinur Rahman Laskar, Partha Pakray, and Sivaji Bandyopadhyay, “Neural Machine Translation for Low Resource Assamese-English,” Proceedings of the International Conference on Computing and Communication Systems: I3CS 2020, NEHU, Shillong, India, vol. 170, pp. 35-44, 2021.
[
CrossRef] [Google Scholar] [Publisher Link]

[13] Mazida Akhtara Ahmed, Kishore Kashyap, and Shikhar Kumar Sarma, “Tokenization Effect on Neural Machine Translation: An Experimental Investigation for English-Assamese,” 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, pp. 1-7, 2023.
[
CrossRef] [Google Scholar] [Publisher Link]

[14] Sahinur Rahman Laskar et al., “EnAsCorp1.0: English-Assamese Corpus,” Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, Suzhou, China, pp. 62-68, 2020.
[
CrossRef] [Google Scholar] [Publisher Link]

[15] Sahinur Rahman-Laskar et al., “A Domain Specific Parallel Corpus and Enhanced English-Assamese Neural Machine Translation,” Computación y Sistemas, vol. 26, no. 4, pp. 1669-1687, 2022.
[
CrossRef] [Google Scholar] [Publisher Link]

[16] Arun Baby et al., “Context-based out-of-Vocabulary   Word   Recovery   for   ASR   Systems   in   Indian   Languages,” arXiv preprint, pp. 1-12, 2022.
        [CrossRef] [Google Scholar] [Publisher Link]