Research Article | Open Access | Download PDF
Volume 74 | Issue 4 | Year 2026 | Article Id. IJETT-V74I4P111 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I4P111A Pre-processing Tool for Management of Aerospace OOV Words for Building a Bilingual Corpus of English to Assamese
Pratul Kalita, Saptarshi Paul, Bimal Kumar Kalita, Saurav Paul
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 03 Jun 2025 | 22 Jan 2026 | 19 Feb 2026 | 29 Apr 2026 |
Citation :
Pratul Kalita, Saptarshi Paul, Bimal Kumar Kalita, Saurav Paul, "A Pre-processing Tool for Management of Aerospace OOV Words for Building a Bilingual Corpus of English to Assamese," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 4, pp. 144-154, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I4P111
Abstract
OOV terminology and structured English phrases, which are exclusive to the aerospace industry, are crucial to this field. Translating these words and phrases is a difficult task. The use of social media, such as Facebook, Twitter, etc., by airlines as a significant tool for communication and advertising has led to the inclusion of these aerospace-related terms in everyday reading materials like newspapers. To make it easier for MT(Machine Translation) systems to translate these structured phrases and OOV terms, such as TFC, LEO, a tool that can recognize them and substitute them with the appropriate meaningful English word is urgently needed. For accurate translation, the MT(Machine Translation) systems must then be trained utilizing aerospace parallel corpora (English-target language). Another gap that must be remedied is the lack of such a corpus. There is a great demand for a tool that can handle both the direct feed to MT tools and assist in building a parallel corpus.
Keywords
Aerospace, OOV Terms, Machine Translation, Corpus.
References
[1] Airport
Authority of India (AAI), 2026. [Online]. Available: https://www.aai.aero/
[2] All
India Association of Industries (AIAI), Aerospace, 2026. [Online]. Available: https://aiaiindia.com/aerospace/
[3] Airport
Authority of India (AAI), National Civil Aviation Policy 2016. [Online].
Available: https://www.aai.aero/en/node/4528
[4] MoCA-Ministry
of Civil Aviation, 2026. [Online]. Available: https://www.civilaviation.gov.in/
[5] Zhongyu Zhuang et al.,
“Out-of-Vocabulary Word Embedding Learning based on Reading Comprehension
Mechanism,” Natural Language Processing Journal, vol. 5, pp. 1-6, 2023.
[CrossRef]
[Google Scholar]
[Publisher Link]
[6] Gregory S. Jones et
al., Research Opportunities Aerospace Concepts, Hampton, Virginia: National
Aeronautics and Space Administration (NASA), Langley
Research Center, 2000. [Online]. Available: https://searchworks.stanford.edu/view/12264135
[7] Jeongin
Kim, Taekeun Hong, and Pankoo Kim, “Replacing out-of-Vocabulary Words with an
Appropriate Synonym based on Word2VnCR,” Mobile Information Systems,
vol. 2021, no. 1, pp. 1-7, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Vijay
Kumar Sharma, Namita Mittal, and Ankit Vidyarthi, “Context-based Translation
for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual
Information Retrieval,” IETE Technical Review, vol. 39, no. 2, pp.
276-285, 2022.
[CrossRef] [Google Scholar]
[Publisher Link]
[9] Rodali Assamese
Keyboard. [Online]. Available: https://rodali-assamese-keyboard.en.softonic.com/android
[10] Sahinur
Rahman Laskar et al., “Improving English-Assamese Neural Machine Translation
using Transliteration-based Approach,” Evolution in Computational
Intelligence: Proceedings of the 10th International Conference on
Frontiers in Intelligent Computing: Theory and Applications, Cardiff, United Kingdom, vol. 326, pp. 223-231, 2023.
[CrossRef]
[Google Scholar]
[Publisher Link]
[11] Rudolf A. Braun, Srikanth Madikeri, and Petr Motlicek,
“A Comparison of Methods for
OOV-Word Recognition on a New Public Dataset,”
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), Toronto, ON, Canada, pp. 5979-5983, 2021.
[CrossRef]
[Google Scholar]
[Publisher Link]
[12] Sahinur
Rahman Laskar, Partha Pakray, and Sivaji Bandyopadhyay, “Neural Machine
Translation for Low Resource Assamese-English,” Proceedings of the International
Conference on Computing and Communication Systems: I3CS 2020, NEHU,
Shillong, India, vol. 170, pp. 35-44, 2021.
[CrossRef]
[Google Scholar]
[Publisher Link]
[13] Mazida
Akhtara Ahmed, Kishore Kashyap, and Shikhar Kumar Sarma, “Tokenization Effect
on Neural Machine Translation: An Experimental Investigation for
English-Assamese,” 2023 14th
International Conference on Computing Communication and Networking Technologies
(ICCCNT), Delhi, India, pp.
1-7, 2023.
[CrossRef] [Google Scholar]
[Publisher Link]
[14] Sahinur
Rahman Laskar et al., “EnAsCorp1.0: English-Assamese Corpus,” Proceedings of
the 3rd Workshop on Technologies for MT of Low Resource Languages,
Suzhou, China, pp. 62-68, 2020.
[CrossRef] [Google Scholar]
[Publisher Link]
[15] Sahinur
Rahman-Laskar et al., “A Domain Specific Parallel Corpus and Enhanced
English-Assamese Neural Machine Translation,” Computación y Sistemas,
vol. 26, no. 4, pp. 1669-1687, 2022.
[CrossRef]
[Google Scholar]
[Publisher Link]
[16] Arun Baby et al., “Context-based out-of-Vocabulary Word
Recovery for ASR
Systems in Indian
Languages,” arXiv preprint, pp. 1-12, 2022.
[CrossRef] [Google Scholar] [Publisher Link]