|Year : 2013 | Volume
| Issue : 2 | Page : 46-51
Quantitative analysis of pathological female human voice by processing complete sentences recordings
Andrea Ancillao1, Manuela Galli2, Michele Mignano1, Rossella Dellavalle1, Giorgio Albertini1
1 Department of Paediatric Rehabilitation, IRCCS San Raffaele Pisana, Rome, Italy
2 Department of Paediatric Rehabilitation, IRCCS San Raffaele Pisana, Rome; Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
|Date of Web Publication||7-May-2014|
IRCCS San Raffaele Pisana, Via della Pisana 235, 00163, Rome
Source of Support: None, Conflict of Interest: None
| Abstract|| |
Objectives: The aim of this work was to determine, by computing quantitative parameters over the recording of complete spoken sentences, female human voices obtained from dysphonic individuals. Healthy subjects were also recruited to compute reference values. Materials and Methods: In this study, a total of 15 female subjects who were diagnosed with dysphonia were enrolled. Age matched female controls were also enrolled. Each subject was asked to read aloud a text composed of three sentences. The subjects were all Italian native speakers and the sentences were written in Italian.Voice was digitally recorded and each sentence was processed by Praat software in order to compute frequency parameters (Pitch, Jitter, Shimmer) and harmonic to noise ratio. Parameters were then compared within the sentences spoken by the same subject and between pathological and control group. Results: The parameters resulted similar among the three sentences read by the same subject, while there were statistically significant differences between the pathological group and the control group. Conclusion: The quantitative analysis of voice, run over complete sentences, is therefore able to characterize pathological subjects and provides useful information that may support the diagnosis of dysphonia.Processing voices from a control group allowed also to develop reference data for female human voice. The found values can be assumed as representative of the normal female voice and may be used as reference data for other studies on pathological voices.
Keywords: Acoustic measurements, dysphonia, phonetics, rehabilitation, speech evaluation
|How to cite this article:|
Ancillao A, Galli M, Mignano M, Dellavalle R, Albertini G. Quantitative analysis of pathological female human voice by processing complete sentences recordings. J Laryngol Voice 2013;3:46-51
|How to cite this URL:|
Ancillao A, Galli M, Mignano M, Dellavalle R, Albertini G. Quantitative analysis of pathological female human voice by processing complete sentences recordings. J Laryngol Voice [serial online] 2013 [cited 2021 Dec 3];3:46-51. Available from: https://www.laryngologyandvoice.org/text.asp?2013/3/2/46/132045
| Introduction|| |
According to Fant  and Flanagan  voice or more in general, verbal signal, is the result of filtering of a signal produced by a source. Filtering is obtained by a complex regulatory mechanism that varies continuously with the volume of air flowing through it. In human anatomy, the filter is represented by pharynx, oral and nasal cavities and the source are the vocal folds that produce the glottic signal. ,
The glottic signal presents itself as regular, harmonic and periodic for healthy vocal folds, while turns aperiodic and irregular for a pathological vocal apparatus.
In case of problems to the vocal folds or vocal apparatus, voice emission loses its harmonic features and some noise may appear superimposed to the harmonic component. In this situation, voice is defined as dysphonic. ,
Otolaryngologists and phoneticians have very often to deal with pathological voices and dysphonia, for which they need to establish the severity of pathology. A method to quantitatively analyze human voice is therefore extremely important to support the clinician in diagnosis making, therapy decision, communication between different operators and assessment of the outcome of specific therapies.
Now-a-days, the clinical evaluation of voice is performed by the clinician, generally by qualitative methods, relying only on auditory skills. ,
Some protocols for qualitative evaluation and qualitative rating scales were proposed. Some examples are: Hammarberg's et al. protocol,  Laver's protocol  and Buffalo Voice Profile System.  Another rating scale, that is widely used today, is the grade, roughness, breathiness, asthena and strain scale (GRBAS), proposed by the Japanese Society of Phoniatrists and Logopedists. 
Even though the GRBAS scale demonstrated a low variability between analysis performed by different operators and the ability to classify patients by severity of pathology, , this scale remains based only on subjective perception.
A quantitative approach to voice analysis is therefore necessary to objectively classify a patient. Quantitative voice analysis provides useful information to the audiologist, it supports the final diagnosis with objective data and it is also useful for outcome evaluation and follow-up.
The quantitative approach works as follows. As voice travels through air as a pressure wave, such wave can be captured and converted into electrical signal by a microphone. The electrical signal, produced by the microphone (analog signal), can be properly sampled, converted to a digital signal and stored on a pc as a numeric sequence.
Once the voice is recorded as a numerical signal, the waveform can be processed in time domain and in frequency domain. By this way, harmonic components of the signal can be studied, formants can be recognized and some quantitative parameters can be computed.
Some very important parameters are: Jitter, Shimmer and harmonic to noise ratio (HNR). These allow to quantitatively describe the properties and the characteristics of the recorded voice. Jitter value is expressed as % and it indicates short-term variations in the fundamental frequency of the recorded signal. Shimmer is expressed as % and it indicates short-term variations in the amplitude of the fundamental frequency. HNR represents the ratio between the strength of harmonic component of the signal over the disharmonic component.
Jitter, Shimmer and HNR have proved to be related to the qualitative roughness of the voice. ,,, These quantitative parameters are therefore suitable for objective analysis of voice quality.
An objective investigation of human voice was carried out by Dehqan et al. in 2010.  They computed: Fundamental frequency (F0 ) (Hz), Jitter (%), Shimmer (%) and HNR (dB) of voices produced by healthy speakers. Subjects were asked to pronounce the vowels/a/and/i/for about 3 s and frequency analysis was performed over trials recorded. This study showed statistical differences (P < 0.05) in the computed parameters between male and female adults groups, while there were no statistical difference (P > 0.05) and the parameters were stable within the same gender for adults between 20 and 50 years of age.
A similar work was carried out by Brockmann et al. in 2011.  They studied frequency, Jitter and Shimmer measurements on healthy voices pronouncing the vowels/a/,/o/and/i/. This work demonstrated that men had a lower fundamental frequency and lower Jitter and Shimmer value then women. At the same time, men had a louder voice then women. Therefore there were significant differences between the gender groups, while values were similar inside the same group.
Brockmann et al. also provided detailed statistics, which shows that Jitter and Shimmer measurements were significantly influenced by the sound pressure level. They therefore recommended to use the same recording procedure and signal amplification settings for every patient to be examined. 
The described works were focused on the analysis of the voice by simple vocal emissions. The next step was the analysis of the voice emitted during the pronunciation of a whole word (or also a non-word, formed by vocal and consonants) that better represents the real condition of phonation. A study in this sense was made by Albertini et al. in 2009.  They recruited three groups of healthy subjects: Men, women (mean age of men and women 56 ± 9 years old) and children (mean age 9.4 ± 1.4 years old). The voice of each subject was recorded while the subject spoke some words and non-words from the Italian language, such as: "Gioco" (play), "torta" (cake), "rife" (non-word), "drappo" (drape), "lovaba" (non-word) etc. Each word was processed separately and Fundamental Frequency, Jitter, Shimmer and Loudness were computed. This study provided normative data useful for the analysis of single words (but limited to the Italian language). Statistical analysis showed significant differences between the groups. Jitter was significantly higher in children with respect to women and men, while Shimmer was significantly higher in men with respect to women and children. There were very low and non-significant differences within each group.  This work demonstrated that these parameters are able to quantitatively classify a voice. Furthermore, these parameters could be used to analyze a complete spoken word and therefore the application is not limited to single vocal emissions. In another work, Albertini et al.  studied the spectral characteristics of the voice in Downs Syndrome (DS) subjects. Voice was investigated by computing mean frequency, coefficient of variation (CV) of pitch, energy, duration, Jitter and Shimmer for some words spoken by the subjects. Results were compared with a control group. This study showed that the voice of DS adults was characterized by a significantly higher mean frequency, particularly in males, by a smaller variation and by a significantly lower level of energy. Furthermore, a shorter duration and a smaller Shimmer were observed in male adults. Instead, the difference between DS children and the age-matched controls was limited, reaching significance only for the CV of pitch.
Pitch is an important parameter used in voice analysis that describes the frequency content and the dominant frequencies in the waveform. 
Aims of the study
All the previous studies focused attention on the analysis of normal voices or voices by DS subjects, paying less attention to dysphonic voices. The aim of this work was to study, by computing quantitative parameters over the recording of complete spoken sentences, pathological voices obtained from dysphonic individuals. The pathological group was composed of Italian native speaking female subjects who were diagnosed with dysphonia.
A normative for complete sentences analysis was necessary to make comparisons with quantitative values obtained by the analysis of pathological voices. Healthy subjects were therefore recruited to compute reference values.
| Materials and Methods|| |
All the subjects, pathological and control were of Italian nationality and spoke Italian as mother tongue.
Control group was composed of 20 female healthy volunteers. Their mean age was 38 years (range 26-52). None of them underwent surgical operations, suffered from tabagism or had any other problem that could alter the voice. Every control subject was evaluated through GRBAS scale. ,
The GRBAS grade consists in the qualitative evaluation of five voice parameters: GRBAS. A sixth parameter, the Instability index, was also considered, as proposed by Dejonckere et al.  For each parameter, the operator assigns a score from 0 to 3, were 0 represents a healthy voice and 1, 2, 3 represents the severity of voice irregularity. The final GRBAS grade is the sum of grades assigned to each parameter and the maximum value, corresponding to the maximum severity is therefore 18.
Control subjects were excluded from the analysis if their GRBAS grades were ≥1. GRBAS <1 represents a clean healthy voice without any perceptible disorder.
In this study, 15 pathological subjects were included. Subjects were female adults (mean age 35 years, range 25-47), without any mental or physical disability, who were diagnosed voice problems, such as roughness and dysphonia. GRBAS grade was estimated for each subject. The inclusion criteria for pathological subjects were a GRBAS grade ≥10.
Voice was recorded by the high quality voice recorder and spectrometer, Computerized Speech Lab model 4150 manufactured by Kay PENTAX (http://www.kayelemetrics.com). The recorder included a high quality cabled microphone that was placed at a distance of about 20 cm from the lips and 45° angle. Voice was digitally recorded at a sampling rate of 50 kHz. Recordings were made in a sound controlled room, without windows and equipped with low-reverberation walls.
The subjects were asked to read aloud at least three sentences from the text "Il Deserto" (The Desert) that is a passage very often used in phonetics, to stimulate voice and evaluate phonation.  The text was written in Italian since all the subjects were native Italian mother tongue.
The three sentences went as follows.
Sentence 1: "Il deserto θ un'immensa distesa di sabbia priva d'acqua e di vegetazione sulla quale si alzano collinette chiamate dune." (The desert is an immense expanse of sand without water or vegetation on which raises mounds named dunes).
Sentence 2: "Sul deserto il cielo θ quasi sempre infuocato e il sole brucia e accieca." (On the desert the sky is almost always on fire and the sun burns and blinds).
Sentence 3: "Di giorno il caldo θ soffocante, ma di notte la sabbia si raffredda e la temperatura si fa rigida." (By day the heat is stifling, but at night the sand cools and the temperature becomes very cold).
The three sentences were read consecutively and recorded in the same acquisition, on the same audio file. The "wav" uncompressed format was chosen as it allowed lossless recording of audio data. The recording was repeated once for each subject.
All the trials were processed using Praat v.5.2.23 software, (Paul Boersma and David Weenink from University of Amsterdam). 
Data was processed according to the following steps: (1) Recognizing and trimming the sentence to process from the whole acquisition. This process was carried out by the operator who listened to the recorded signal and selected the sentence to trim, in the waveform visualization. (2) Generating a voice report for the selection. (3) Collecting the following voice parameters from the report, according to the other works described in the literature: ,,,
- Mean pitch (Hz)
- Standard deviation (SD) (of Pitch) (Hz)
- Minimum pitch (Hz)
- Maximum pitch (Hz)
- Jitter (local) (%)
- Shimmer (local) (%)
- Mean HNR (dB).
Pitch is a quantitative parameter that takes into account the dominant frequencies produced by the vocal emission. Jitter is a quantitative parameter that represents the deviation from the fundamental frequency of a signal. It is computed as the average absolute difference between consecutive periods, divided by the average period. Shimmer represents variability of the peak to peak amplitude of the signal. It is computed as the average absolute difference between the amplitudes of consecutive periods, divided by the average amplitude and it is expressed as a percentage. The HNR is a parameter that represents the ratio between the strength of harmonic components of voice and noise components. It is strictly related to voice hoarseness.
Frequency difference between minimum pitch and maximum pitch was also computed to quantify the Frequency Range of spoken voice.
Mean value, SD, median value and interquartile range (IR) were computed for each sentence.
The non-parametric Mann-Whitney U test was used to check if there were statistical differences, at a confidence level of α =0.05, between the parameters of the three sentences. A non-parametric test was chosen since the number of data samples was not believed enough for a parametric test.
Results were then averaged over the three sentences, obtaining one value for each parameter and each subject. The non-parametric Mann-Whitney U test was used again to study statistical differences between pathological and control groups. A confidence level α =0.05 was chosen and the process was repeated for each parameter to study.
Values for average, SD and CV were then averaged over the subjects, in order to have a single value for each parameter and for each group.
Results were represented through bar graphs with error range. Furthermore, a detailed table was presented to compare results between pathological group and control group.
This study was approved by the Institutional Scientific Board of IRCSS San Raffaele Pisana, Roma, Italy. Each subject had given consent to use her data for research purposes.
| Results|| |
A total of 15 subjects were included in the study. At a preliminary clinical assessment, they had an average GRBAS grade of 14, SD = 3.2 with a minimum of 10 and a maximum of 18.
All the parameters were averaged over the subjects of each group. Results of each group are reported in [Table 1] in terms of median values and IR. Results are also reported through bar graphs, shown in [Figure 1] and [Figure 2]. The black error bars represents the SD.
|Table 1: Average data, between the three sentences, for the two groups studied|
Click here to view
|Figure 1: Averaged Pitch parameters for both groups. Black error bars represent the standard deviation between subjects. *Statistically significant difference between the groups with P < 0.05|
Click here to view
|Figure 2: Averaged Jitter, Shimmer and harmonic to noise ratio parameters for both groups. Black error bars represent the standard deviation between subjects. *Statistically significant difference between the groups with P < 0.05|
Click here to view
The statistical test showed no statistical significant differences between the three sentences for all the studied parameters and for both groups, with a P > 0.05.
The SD between control subjects was very small for all the parameters, except for maximum pitch and Frequency Range, [black error bars in [Figure 1] and [Figure 2], suggesting a low dispersion of values between the subjects. The SD was, instead, higher for all the parameters from the pathological group, showing a higher dispersion of data. The mean values of the parameters related to pitch: Mean pitch, SD of Pitch and Frequency Range (difference max pitch - min pitch) were found to be higher for the dysphonic subjects [Figure 1] and [Figure 2]. Jitter and Shimmer of dysphonic subjects were higher than controls. HNR of dysphonic subjects was instead lower than control group.
The statistical test, performed between the pathological group and the control group, showed significant differences for all the parameters. SD of Pitch and minimum pitch had a statistically significant difference with P < 0.05, while maximum pitch, Frequency Range (difference between maximum and minimum pitch), Jitter, Shimmer and HNR showed a very high level of significance with P < 0,001.
| Discussion|| |
Mean values of parameters were similar among the three sentences, suggesting that the parameters were related to the specific voice and did not depend on specific words spoken. This was true for pathological group and for control group.
The control group showed a very low variability of parameters between the subjects. This consistency was suggested by a very low SD for Jitter, Shimmer and HNR and for Mean Pitch and SD of Pitch. This means that the values were similar within the healthy subjects. Found values therefore represented a normality reference for the female adult voice. On the contrary, the higher SD between pathological subjects suggested a high variability between subjects. This might be due to the different grades of dysphonia of patients included in the study. In fact, pathological subjects were characterized by GRBAS grades, ranging from 10 to 18 with an average of 15.
The statistical comparison between pathological subjects and control confirmed that all the parameters were clearly suitable to characterize the patients in the pathological group, with respect to the control group.
These results were similar to the results provided by Albertini et al.  that found voice of DS subjects having a significantly higher frequency than control group.
The high SD of Pitch of dysphonic group found in this study was also compatible with the high spoken Frequency Range found in the dysphonia group. This result was also supported by high Jitter and Shimmer values that represents respectively short term variation in the Fundamental Frequency and its amplitude.
The SD of pitch was already proved to be related to voice patterns.  A lower SD indicates a power spectrum contained near the Mean Dominant frequency, while a high SD suggests a wider spectrum, typical of pathological voices. SD was also proved to be gender independent. ,
Results are in accordance with qualitative considerations. In fact, dysphonic voices are unstable, tremulous and insecure. This means a high variability in the frequencies produced and then a high variability in the dominant frequency. This phenomenon is quantified by the SD of Pitch and also by the Jitter and Shimmer parameters.
A pathological voice is also qualitatively perceived as weaker, less clear and less limpid than a normal voice. These characteristics are quantified by the HNR value that represents the ratio between the harmonic component of the signal and the non-harmonic components that produces distortion and perceived noise. According to qualitative considerations, HNR was lower for dysphonic and higher for control group.
| Conclusions|| |
This study demonstrated that the quantitative analysis of spoken sentences, in terms of frequency parameters, can provide useful information about the health of a human voice.
Results showed significant differences between pathological subjects and control. Frequency parameters such as Mean Pitch, SD of Pitch, but also parameters that indicates Pitch variations, such as Jitter, Shimmer and Frequency Range, were significantly higher for dysphonic subjects. The HNR was, instead, high for control subjects, indicating that controls have a voice cleaner than dysphonic subjects. Thus, the studied quantitative parameters have the ability to identify a pathological voice. The analysis may therefore be used to detect voice abnormalities, to monitor the outcomes of a therapy and to support the diagnosis of dysphonia with objective data.
The spectral analysis of the human voice is clearly a useful tool in the rehabilitation of speech and language disorders.
Processing voices from a control group allowed also to develop reference data for female human voice. As parameters found in this study were very similar within the control subjects, the found values can be assumed representative of the normal female voice and may be used as reference data for other studies on pathological voices.
This work does however have some limitations. First of all the analysis is limited to the female adult voice. Future development of the study may extend the analysis to male voice and children's voice. Furthermore, pathologic subjects were enrolled with the inclusion criterion of GRBAS ≥10 and in fact we studied subject with GRBAS ranging from 10 to 18. This is quite limiting because GRBAS represents the severity of pathology. Further study may involve recruiting different groups composed of subjects with the same GRBAS grade. This may also allow to study a correlation between quantitative parameters and the grade of pathology.
| Acknowledgments|| |
The authors wish to acknowledge all the people that took part in this study, the staff of IRCSS San Raffaele Pisana, Roma, Italy and MD Claudia Condoluci, TdR Nunzio Tenore, TdR Federica Alberici that helped in subjects recruitment and data acquisition.
| References|| |
|1.||Fant G. Acoustic Theory of Speech Production. The Hague: Mouton; 1960. |
|2.||Flanagan JL. Speech Analysis, Synthesis, and Perception. Berlin: Springer; 1972. |
|3.||Hirano M, Hibi S, Yoshida T, Hirade Y, Kasuya H, Kikuchi Y. Acoustic analysis of pathological voice. Some results of clinical application. Acta Otolaryngol 1988;105:432-8. |
|4.||Dejonckere PH, Remacle M, Fresnel-Elbaz E, Woisard V, Crevier L, Millet B. Reliability and clinical relevance of perceptual evaluation of pathological voices. Rev Laryngol Otol Rhinol (Bord) 1998;119:247-8. |
|5.||Wirz S. Perceptual Approaches to Communication Disorders. London: Whurr; 1995. |
|6.||Hammarberg B, Fritzell B, Gauffin J, Sundberg J, Wedin L. Perceptual and acoustic correlates of abnormal voice qualities. Acta Otolaryngol 1980;90:441-51. |
|7.||Laver J. The Phonetic Description of Voice Quality. London: Cambridge University Press; 1980. |
|8.||Hirano M. Clinical Examination of Voice. New York: Springer-Verlag; 1981. |
|9.||Dejonckere PH, Remacle M, Fresnel-Elbaz E, Woisard V, Crevier-Buchman L, Millet B. Differentiated perceptual evaluation of pathological voice quality: Reliability and correlations with acoustic measurements. Rev Laryngol Otol Rhinol (Bord) 1996;117:219-24. |
|10.||De Bodt MS, Wuyts FL, Van de Heyning PH, Croux C. Test-retest study of the GRBAS scale: Influence of experience and professional background on perceptual rating of voice quality. J Voice 1997;11:74-80. |
|11.||Dehqan A, Ansari H, Bakhtiar M. Objective voice analysis of Iranian speakers with normal voices. J Voice 2010;24:161-7. |
|12.||Brockmann M, Drinnan MJ, Storck C, Carding PN. Reliable jitter and shimmer measurements in voice clinics: The relevance of vowel, gender, vocal intensity, and fundamental frequency effects in a typical clinical task. J Voice 2011;25:44-53. |
|13.||Albertini G, Giaquinto S, Mignano M. Spectral analysis of the human voice: A potentially useful tool in rehabilitation. Eur J Phys Rehabil Med 2009;45:537-45. |
|14.||Albertini G, Bonassi S, Dall'Armi V, Giachetti I, Giaquinto S, Mignano M. Spectral analysis of the voice in Down Syndrome. Res Dev Disabil 2010;31:995-1001. |
|15.||Fussi F. Il Trattamento Logopedico Delle Disfonie Ipercinetiche. Italy: Omega; 1992. |
|16.||Boersma P, Weenink D. PRAAT. The Netherlands: University of Amsterdam; 2006. Available from: http://www.fon.hum.uva.nl/praat/ . [Last accessed on 2012 Jun 06]. |
[Figure 1], [Figure 2]