Designing a Phoneme-Level Emotion Conversion Framework for Continuous Hindi Speech

Archana Agarwal, Dr. Vipan Kumari

pdf

Keywords:

Phoneme-Level Emotion Conversion, Continuous Hindi Speech, Speech Emotion Transformation, Acoustic Feature Modeling, Prosody Modification, Mel-Frequency Cepstral Coefficients (MFCC), Deep Neural Networks, Speech Signal Processing

Archana Agarwal, Dr. Vipan Kumari

Abstract

Speech is not merely a carrier of linguistic information; it is a rich medium that conveys emotional, social, and psychological cues. Emotion plays a critical role in human communication, influencing interpretation, response, and interpersonal dynamics. In recent years, advances in speech processing and artificial intelligence have enabled machines to analyze, synthesize, and manipulate speech signals with increasing sophistication. However, while significant progress has been made in English and other high-resource languages, emotion conversion in Hindi continuous speech remains relatively underexplored. Most existing systems operate at the sentence or word level and often fail to capture fine-grained phonetic and prosodic variations essential for natural emotional transformation. This study proposes a phoneme-level emotion conversion framework for continuous Hindi speech that aims to enhance naturalness, intelligibility, and emotional expressiveness.

Emotion conversion refers to the transformation of a speech signal from one emotional state to another while preserving the linguistic content and speaker identity. Traditional approaches rely on prosodic manipulation at the utterance or word level, including pitch scaling, duration modification, and energy transformation. However, emotional cues are often embedded at a finer granularity within phonemes and sub-phonemic acoustic units. Hindi, as an Indo-Aryan language with a rich phonemic inventory including aspirated and unaspirated consonants, retroflex sounds, nasalization, and vowel length contrasts, presents unique challenges for emotion modeling. Emotional variations may manifest differently across phoneme categories, especially in voiced consonants and long vowels, making phoneme-level modeling particularly relevant.

The proposed framework introduces a multi-stage architecture consisting of speech preprocessing, phoneme segmentation, acoustic feature extraction, emotion embedding modeling, phoneme-level emotion mapping, and waveform reconstruction. The system utilizes forced alignment techniques for accurate phoneme boundary detection in continuous speech. Acoustic features such as Mel-Frequency Cepstral Coefficients (MFCCs), pitch contour, formant frequencies, spectral tilt, energy envelope, and temporal duration are extracted at the phoneme level. A deep neural network-based mapping model is trained to learn transformations between neutral and target emotional states, including happiness, sadness, anger, and fear. The framework integrates speaker-independent emotion embeddings to ensure emotional transformation without altering speaker identity.

A curated Hindi emotional speech corpus was developed for experimental validation, containing balanced emotional utterances recorded by male and female speakers across diverse age groups. Both objective and subjective evaluation methods were employed. Objective measures include Mel Cepstral Distortion (MCD), F0 Root Mean Square Error (RMSE), and Signal-to-Noise Ratio (SNR) comparisons. Subjective evaluation involved Mean Opinion Score (MOS) tests conducted with native Hindi listeners to assess naturalness, emotional accuracy, and intelligibility. Results indicate that phoneme-level transformation significantly improves emotional clarity and naturalness compared to conventional word-level approaches. The system demonstrated improved emotional recognition rates in listening tests, with statistically significant differences observed across emotional categories.

This research contributes to the field of speech emotion processing by introducing a novel phoneme-centric approach tailored to Hindi continuous speech. The findings highlight the importance of fine-grained acoustic modeling in emotion conversion systems and open new possibilities for applications in expressive text-to-speech systems, voice assistants, speech therapy, dubbing, gaming, virtual reality, and human-computer interaction in multilingual contexts. The study also lays the groundwork for future research in cross-lingual emotion transfer and real-time phoneme-level emotion adaptation systems.

How to Cite

Archana Agarwal, Dr. Vipan Kumari. (2024). Designing a Phoneme-Level Emotion Conversion Framework for Continuous Hindi Speech. International Journal of Advanced Research and Multidisciplinary Trends (IJARMT), 1(2), 714–724. Retrieved from https://ijarmt.com/index.php/j/article/view/763

Issue

Vol. 1 No. 2 (2024): Oct - Dec 2024

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

References

 Busso, C., et al. (2008). IEMOCAP emotional speech database.

 Cho, K. et al. (2014). Learning phrase representations using RNN encoder–decoder.

 Cowie, R., & Cornelius, R. (2003). Describing the emotional states expressed in speech. Speech Communication.

 Eyben, F., et al. (2010). The openSMILE toolkit.

 Goodfellow, I. et al. (2014). Generative adversarial networks.

 Hsu, C. C., et al. (2017). Voice conversion from non-parallel corpora.

 Kain, A., & Macon, M. (1998). Spectral voice conversion.

 Kim, Y., et al. (2018). Emotional voice conversion using neural networks.

 Morise, M., et al. (2016). WORLD vocoder.

 Neumann, M., & Vu, N. T. (2019). Attentive convolutional neural networks.

 Rao, K. S., & Yegnanarayana, B. (2009). Modeling emotions in Indian languages.

 Sahu, P., & Saha, G. (2018). Hindi speech emotion recognition using MFCC.

 Schuller, B. et al. (2011). Recognizing realistic emotions and affect in speech. IEEE Transactions on Affective Computing.

 Stylianou, Y. (2009). Voice transformation: A survey. IEEE Transactions on Audio.

 Sundaram, S., & Narayanan, S. (2008). Emotion recognition using speech signals.

 Tokuda, K. et al. (2000). Speech parameter generation algorithms for HMM-based speech synthesis.

 Vaswani, A. et al. (2017). Attention is all you need.

 Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition.

 Wu, Z., & Wang, H. (2006). Emotion conversion in Mandarin speech.

 Zen, H., Tokuda, K., & Black, A. (2009). Statistical parametric speech synthesis.

Article Sidebar

Main Article Content

Abstract

Article Details

References

Similar Articles