Teknologi Pengenalan Suara tentang Metode, Bahasa dan Tantangan: Systematic Literature Review
DOI:
https://doi.org/10.32877/bt.v7i2.1888
Keywords:
Deep learning, Machine learning, Pengenalan suara, PRISMA, SLR
Abstract
Dengan kemajuan teknologi kecerdasan buatan (AI) dan pembelajaran mesin, teknologi pengenalan suara akan terus berkembang. Dalam penelitian ini, akan dilakukan penelitian SLR (Systematic Literature Review) tentang pengenalan suara untuk mencari pembahasan tentang metode yang dipakai, bahasa yang diuji serta hambatan dan tantangan yang sering dihadapi. Sebanyak 2.400 artikel dikumpulkan dari 2 sumber data elektronik yang bersumber dari Scopus dan Semantic Scholar serta dalam rentang tahun 2020-2024, lalu disaring karena adanya duplikasi, tahapan kriteria inklusi dan eksklusi dan tahapan penilaian kualitas, sehingga didapat sekitar 32 artikel yang dipakai untuk menjawab tiga pertanyaan penelitian yang sudah dirumuskan melalui PICOC. Hasil yang didapat adalah terdapat 25 metode yang ditemukan dan metode CNN yang paling banyak dibahas dalam 6 artikel. Dari 28 bahasa yang ditemukan, bahasa Inggris merupakan bahasa yang paling banyak diuji dalam 7 artikel. Selain itu, terdapat 23 macam tantangan dan hambatan, yang paling banyak ditemui pada 17 artikel adalah sumber daya bahasa yang sedikit, dikarenakan hanya satu bahasa resmi dalam suatu negara dan ada pula yang hampir mengalami kepunahan sehingga tidak banyak tersedia untuk umum. Gangguan berupa kebisingan atau noise juga mengganggu dalam menyelesaikan penelitian tersebut. Lalu agar mencapai keakuratan yang tinggi dalam pengenalan suara, dibutuhkan data pelatihan yang besar. Penelitian SLR ini dapat mengidentifikasi tren dan metode terbaik dan terbukti efektif yang dapat dipakai di perangkat IoT, aplikasi smartphone, dan layanan cloud.
Downloads
References
J. Meng, J. Zhang, and H. Zhao, “Overview of the Speech Recognition Technology,” in 2012 Fourth International Conference on Computational and Information Sciences, Chongqing, China: IEEE, Aug. 2012, pp. 199-202. doi: 10.1109/ICCIS.2012.202.
B. H. Juang and L. R. Rabiner, “Automatic Speech Recognition-A Brief History of the Technology Development”.
A. Shenoy, S. Bodapati, and K. Kirchhoff, “ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling,” in Proceedings of The 4th Workshop on e-Commerce and NLP, Online: Association for Computational Linguistics, 2021, pp. 18-25. doi: 10.18653/v1/2021.ecnlp-1.3.
J. Noyes and C. Frankish, “Speech recognition technology for individuals with disabilities,” Augmentative and Alternative Communication, vol. 8, no. 4, pp. 297-303, Jan. 1992, doi: 10.1080/07434619212331276333.
Z. Leini and S. Xiaolei, “Study on Speech Recognition Method of Artificial Intelligence Deep Learning,” J. Phys.: Conf. Ser., vol. 1754, no. 1, p. 012183, Feb. 2021, doi: 10.1088/1742-6596/1754/1/012183.
Ubon Ratchathani Rajabhat University, Thailand and N. K. Dennis, “Using AI-Powered Speech Recognition Technology to Improve English Pronunciation and Speaking Skills,” ije, vol. 12, no. 2, pp. 107-126, Aug. 2024, doi: 10.22492/ije.12.2.05.
D. Gough, J. Thomas, and S. Oliver, “An introduction to systematic reviews,” 2017.
A. Dhouib, A. Othman, O. El Ghoul, M. K. Khribi, and A. Al Sinani, “Arabic Automatic Speech Recognition: A Systematic Literature Review,” Applied Sciences, vol. 12, no. 17, p. 8898, Sep. 2022, doi: 10.3390/app12178898.
A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech Recognition Using Deep Neural Networks: A Systematic Review,” IEEE Access, vol. 7, pp. 19143-19165, 2019, doi: 10.1109/ACCESS.2019.2896880.
V. Bhardwaj et al., “Automatic Speech Recognition (ASR) Systems for Children: A Systematic Literature Review,” Applied Sciences, vol. 12, no. 9, p. 4419, Apr. 2022, doi: 10.3390/app12094419.
A. Booth, A. Sutton, and D. Papaioannou, Systematic approaches to a successful literature review, Second edition. Los Angeles: Sage, 2016.
M. Bruzza, A. Cabrera, and M. Tupia, “Survey of the state of art based on PICOC about the use of artificial intelligence tools and expert systems to manage and generate tourist packages,” in 2017 International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS), Dubai: IEEE, Dec. 2017, pp. 290-296. doi: 10.1109/ICTUS.2017.8286021.
W. Mengist, T. Soromessa, and G. Legese, “Method for conducting systematic literature review and meta-analysis for environmental science research,” MethodsX, vol. 7, p. 100777, 2020, doi: 10.1016/j.mex.2019.100777.
M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” PLoS Med, vol. 18, no. 3, p. e1003583, Mar. 2021, doi: 10.1371/journal.pmed.1003583.
Y. Harie, B. P. Gautam, and K. Wasaki, “Computer Vision Techniques for Growth Prediction: A Prisma-Based Systematic Literature Review,” Applied Sciences, vol. 13, no. 9, p. 5335, Apr. 2023, doi: 10.3390/app13095335.
D. Jiang et al., “A GDPR-compliant Ecosystem for Speech Recognition with Transfer, Federated, and Evolutionary Learning,” ACM Trans. Intell. Syst. Technol., vol. 12, no. 3, pp. 1-19, Jun. 2021, doi: 10.1145/3447687.
I. Quintanilha, S. Netto, and L. Biscainho, “An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora,” JCIS, vol. 35, no. 1, pp. 230-242, 2020, doi: 10.14209/jcis.2020.25.
H. Karunathilaka, V. Welgama, T. Nadungodage, and R. Weerasinghe, “Low-resource Sinhala Speech Recognition using Deep Learning,” in 2020 20th International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka: IEEE, Nov. 2020, pp. 196-201. doi: 10.1109/ICTer51097.2020.9325468.
K. Choutri, M. Lagha, S. Meshoul, M. Batouche, Y. Kacel, and N. Mebarkia, “A Multi-Lingual Speech Recognition-Based Framework to Human-Drone Interaction,” Electronics, vol. 11, no. 12, p. 1829, Jun. 2022, doi: 10.3390/electronics11121829.
Q. H. Nguyen and T.-D. Cao, “A Novel Method for Recognizing Vietnamese Voice Commands on Smartphones with Support Vector Machine and Convolutional Neural Networks,” Wireless Communications and Mobile Computing, vol. 2020, pp. 1-9, Mar. 2020, doi: 10.1155/2020/2312908.
M. Dhakal, A. Chhetri, A. K. Gupta, P. Lamichhane, S. Pandey, and S. Shakya, “Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet,” in 2022 International Conference on Inventive Computation Technologies (ICICT), Nepal: IEEE, Jul. 2022, pp. 515-521. doi: 10.1109/ICICT54344.2022.9850832.
F. R. Jr. Arnel Fajardo, “Convolutional Neural Network for Automatic Speech Recognition of Filipino Language,” IJATCSE, vol. 9, no. 1.1 S I, pp. 34-40, Feb. 2020, doi: 10.30534/ijatcse/2020/0791.12020.
M. Dawodi, J. A. Baktash, T. Wada, N. Alam, and M. Z. Joya, “Dari Speech Classification Using Deep Convolutional Neural Network,” in 2020 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), Vancouver, BC, Canada: IEEE, Sep. 2020, pp. 1-4. doi: 10.1109/IEMTRONICS51293.2020.9216370.
R. Jimerson, R. Ptucha, and E. Prud’hommeaux, “Fully Convolutional ASR for Less-Resourced Endangered Languages”.
R. Jain, A. Barcovschi, M. Y. Yiwere, D. Bigioi, P. Corcoran, and H. Cucu, “A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition,” IEEE Access, vol. 11, pp. 46938-46948, 2023, doi: 10.1109/ACCESS.2023.3275106.
L. R. S. Gris, E. Casanova, F. S. de Oliveira, A. da S. Soares, and A. C. Junior, “Brazilian Portuguese Speech Recognition Using Wav2vec 2.0,” Dec. 22, 2021, arXiv: arXiv:2107.11414. doi: 10.48550/arXiv.2107.11414.
W. Phatthiyaphaibun, C. Chaksangchaichot, P. Limkonchotiwat, E. Chuangsuwanich, and S. Nutanong, “Thai Wav2Vec2.0 with CommonVoice V8,” Aug. 09, 2022, arXiv: arXiv:2208.04799. doi: 10.48550/arXiv.2208.04799.
K. D. N, P. Wang, and B. Bozza, “Using Large Self-Supervised Models for Low-Resource Speech Recognition,” in Interspeech 2021, ISCA, Aug. 2021, pp. 2436-2440. doi: 10.21437/Interspeech.2021-631.
H. A. Alsayadi, A. A. Abdelhamid, I. Hegazy, and Z. T. Fayed, "Arabic speech recognition using end-to-end deep learning," IET Signal Processing, vol. 15, no. 8, pp. 521-534, Oct. 2021, doi:10.1049/sil2.12057.
H. Alsayadi, A. Abdelhamid, I. Hegazy, and Z. Taha, “Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning,” IJICIS, vol. 21, no. 2, pp. 50-64, Jul. 2021, doi: 10.21608/ijicis.2021.73581.1086.
K. D. N, “Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer,” Sep. 10, 2021, arXiv: arXiv:2109.03969. doi: 10.48550/arXiv.2109.03969.
H. Veisi and A. Haji Mani, “Persian speech recognition using deep learning,” Int J Speech Technol, vol. 23, no. 4, pp. 893-905, Dec. 2020, doi: 10.1007/s10772-020-09768-x.
A. Mukhamadiyev, I. Khujayarov, O. Djuraev, and J. Cho, “Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language,” Sensors, vol. 22, no. 10, p. 3683, May 2022, doi: 10.3390/s22103683.
P. Wang and H. Van Hamme, “Benefits of pre-trained mono- and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech,” J AUDIO SPEECH MUSIC PROC., vol. 2023, no. 1, p. 15, Apr. 2023, doi: 10.1186/s13636-023-00280-z.
P. Dubey and B. Shah, “Deep Speech Based End-to-End Automated Speech Recognition (ASR) for Indian-English Accents”.
S. Suyanto, A. Arifianto, A. Sirwan, and A. P. Rizaendra, “End-to-End Speech Recognition Models for a Low-Resourced Indonesian Language,” in 2020 8th International Conference on Information and Communication Technology (ICoICT), Yogyakarta, Indonesia: IEEE, Jun. 2020, pp. 1-6. doi: 10.1109/ICoICT49345.2020.9166346.
C. Liu, F. Zhang, D. Le, S. Kim, Y. Saraf, and G. Zweig, “Improving RNN Transducer Based ASR with Auxiliary Tasks,” Nov. 09, 2020, arXiv: arXiv:2011.03109. doi: 10.48550/arXiv.2011.03109.
Y. Gao, T. Parcollet, and N. Lane, “Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition,” Jul. 04, 2021, arXiv: arXiv:2005.09310. doi: 10.48550/arXiv.2005.09310.
L. Zhang et al., “End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture,” Sensors, vol. 20, no. 7, p. 1809, Mar. 2020, doi: 10.3390/s20071809.
O. Mamyrbayev, K. Alimhan, D. Oralbekova, A. Bekarystankyzy, and B. Zhumazhanov, “Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level,” EEJET, vol. 1, no. 9(115), pp. 84-92, Feb. 2022, doi: 10.15587/1729-4061.2022.252801.
C. Wang, J. Pino, and J. Gu, “Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation,” Oct. 09, 2020, arXiv: arXiv:2006.05474. doi: 10.48550/arXiv.2006.05474.
S. Guillaume et al., “Fine-tuning pre-trained models for Automatic Speech Recognition, experiments on a fieldwork corpus of Japhug (Trans-Himalayan family),” in Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 170-178. doi: 10.18653/v1/2022.computel-1.21.
Y. Tang, J. Pino, C. Wang, X. Ma, and D. Genzel, “A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada: IEEE, Jun. 2021, pp. 6209-6213. doi: 10.1109/ICASSP39728.2021.9415058.
Z. Song, “English speech recognition based on deep learning with multiple features,” Computing, vol. 102, no. 3, pp. 663-682, Mar. 2020, doi: 10.1007/s00607-019-00753-0.
T. Alam, A. Khan, and F. Alam, “Punctuation Restoration using Transformer Models for High-and Low-Resource Languages,” in Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Online: Association for Computational Linguistics, 2020, pp. 132-142. doi: 10.18653/v1/2020.wnut-1.18.
S. Wang, “Recognition of English speech-using a deep learning algorithm,” Journal of Intelligent Systems, vol. 32, no. 1, p. 20220236, Feb. 2023, doi: 10.1515/jisys-2022-0236.
J. Wang, “Speech Recognition of Oral English Teaching Based on Deep Belief Network,” Int. J. Emerg. Technol. Learn., vol. 15, no. 10, p. 100, Jun. 2020, doi: 10.3991/ijet.v15i10.14041.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 bit-Tech
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
I hereby assign and transfer to bit-Tech all exclusive copyright ownership rights to the above work. This includes, but is not limited to, the right to publish, republish, downgrade, distribute, transmit, sell, or use the work and other related materials worldwide, in whole, or in part, in all languages, in electronic, printed, or any other form of media, now known or hereafter developed and reserves the right to permit or license a third party to do any of the above. I understand that this exclusive right will belong to bit-Tech from the date the article is accepted for publication. I also understand that bit-Tech, as the copyright owner, has sole authority to license and permit reproduction of the article. I understand that, except for copyright, any other proprietary rights associated with the work (e.g. patents or other rights to any process or procedure) must be retained by the author. In addition, I understand that bit-Tech permits authors to use their papers in any way permitted by the applied Creative Commons license.