ESpeak

{{Short description|Compact, open-source, software speech synthesizer}}

{{Use dmy dates|date=March 2024}}

{{lowercase title}}

{{Infobox software

|name = eSpeakNG

|logo = File:ESpeak logo.svg

|screenshot =

|caption = Logo of eSpeak

|collapsible =

|author = Jonathan Duddington

|developer = Alexander Epaneshnikov [https://github.com/orgs/espeak-ng/people et al.]

|released = {{start date and age|df=yes|2006|02}}

| latest release version = {{wikidata|property|preferred|references|edit|P348|P548=Q2804309}}

| latest release date = {{Start date and age|{{wikidata|qualifier|preferred|single|P348|P548=Q2804309|P577}}|df=yes}}

|latest preview version =

|latest preview date =

|programming language = C

|operating system = Linux
Windows
macOS
FreeBSD

|platform =

|language =

|genre = Speech synthesizer

|license = GPLv3

|website = {{URL|https://github.com/espeak-ng/espeak-ng/}}

|repo = {{URL|https://github.com/espeak-ng/espeak-ng/}}

}}

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG (Next Generation) is a continuation of the original developer's project with more feedback from native speakers.

Because of its small size and many languages, eSpeakNG is included in NVDA{{Cite web|url=https://github.com/nvaccess/nvda/issues/5651|title=Switch to eSpeak NG in NVDA distribution · Issue #5651 · nvaccess/nvda|website=GitHub}} open source screen reader for Windows, as well as Android,{{Cite web|url=https://play.google.com/store/apps/details?id=com.googlecode.eyesfree.espeak|title=eSpeak TTS - Android Apps on Google Play|website=play.google.com}} Ubuntu{{Cite web|url=https://launchpad.net/ubuntu/+source/espeak-ng/+index|title=espeak-ng package : Ubuntu|date=21 December 2023|website=Launchpad}} and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 2016{{Cite web|url=https://support.office.com/en-us/article/download-voices-for-immersive-reader-read-mode-and-read-aloud-4c83a8d8-7486-42f7-8e46-2b0fdf753130|title = Download voices for Immersive Reader, Read Mode, and Read Aloud}} and was used by Google Translate for 27 languages in 2010;Google blog, [https://translate.googleblog.com/2010/05/giving-voice-to-more-languages-on.html Giving a voice to more languages on Google Translate], May 2010 17 of these were subsequently replaced by proprietary voices.Google blog, [https://translate.googleblog.com/2010/12/listen-to-us-now.html Listen to us now], December 2010.

The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia.{{Cite web|url=https://espeak.sourceforge.net/languages.html|title=eSpeak Speech Synthesizer|website=espeak.sourceforge.net}} Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

History

In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English.{{Cite web|url=https://espeak.sourceforge.net/|title=eSpeak: Speech Synthesizer|website=espeak.sourceforge.net}} On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007.{{Cite web|url=https://sourceforge.net/projects/espeak/files/espeak/|title = ESpeak: Speech synthesis - Browse /Espeak at SourceForge.net}} Development on Speak continued until version 1.14, when it was renamed to eSpeak.

Development of eSpeak continued from 1.16 (there was not a 1.15 release) with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion,{{Cite web|url=https://sourceforge.net/p/espeak/code/commit_browser|title=eSpeak: speech synthesis / Code / Browse Commits|website=sourceforge.net}} with separate source and binary downloads made available on SourceForge. From eSpeak 1.27, eSpeak was updated to use the GPLv3 license. The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS.{{Cite web|url=http://espeak.sourceforge.net/download.html|title = Espeak: Downloads}} The last development release of eSpeak was 1.48.15 on 16 April 2015.http://espeak.sourceforge.net/test/latest.html {{Bare URL inline|date=August 2024}}

eSpeak uses the Usenet scheme to represent phonemes with ASCII characters.{{cite CiteSeerX|title=Latin to Speech|date=26 July 2007|first1=Jan-Wilem|last1=van Leussen|first2=Maarten|last2=Tromp|page=6|df=dmy-all|citeseerx=10.1.1.396.7811}}

= eSpeak NG =

On 25 June 2010,{{Cite web|url=https://github.com/rhdunn/espeak/commit/63daaecefccde34b700bd909d23c6dd2cac06e20|title = Build: Allow portaudio 18 and 19 to be switched easily. · rhdunn/Espeak@63daaec|website = GitHub}} Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.

On 4 October 2015 (6 months after the 1.48.15 release of eSpeak), this fork started diverging more significantly from the original eSpeak.{{Cite web|url=https://github.com/rhdunn/espeak/commit/61522a12a38453a4e854fd9c9e0994ad80420243|title = Espeakedit: Fix argument processing for unicode argv types · rhdunn/Espeak@61522a1|website = GitHub}}{{Cite web|url=https://github.com/nvaccess/nvda/issues/5651#issuecomment-170288487|title = Switch to eSpeak NG in NVDA distribution · Issue #5651 · nvaccess/Nvda|website = GitHub}}

On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence.{{Cite web|url=https://sourceforge.net/p/espeak/mailman/message/34679511/|title=[Espeak-general] Taking ownership of the espeak project and its future | eSpeak: speech synthesis|website=sourceforge.net}}{{Cite web|url=https://sourceforge.net/p/espeak/mailman/message/34680161/|title=[Espeak-general] Vote for new main espeak developer | eSpeak: speech synthesis|website=sourceforge.net}} The result of this was the creation of the espeak-ng (Next Generation) fork, using the GitHub version of eSpeak as the basis for future development.

On 11 December 2015, the espeak-ng fork was started.{{Cite web|url=https://github.com/espeak-ng/espeak-ng/issues/2|title=Rebrand the espeak program to espeak-ng. · Issue #2 · espeak-ng/espeak-ng|website=GitHub}} The first release of espeak-ng was 1.49.0 on 10 September 2016,{{Cite web|url=https://github.com/espeak-ng/espeak-ng/releases/tag/1.49.0|title=Release 1.49.0 · espeak-ng/espeak-ng|website=GitHub}} containing significant code cleanup, bug fixes, and language updates.

Features

eSpeakNG can be used as a command-line program, or as a shared library.

It supports Speech Synthesis Markup Language (SSML).

Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.

eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.

Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en "Hello w3:ld" will say {{audio|Hello world said by eSpeakNG.ogg|Hello world|help=no}} in English.

Synthesis method

File:ESpeakNG intro by eSpeakNG in English.ogg

eSpeakNG can be used as text-to-speech translator in different ways, depending on which text-to-speech translation step user want to use.

= 1. step – text to phoneme translation =

There are many languages (notably English) which do not have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.

  1. input text is translated into pronunciation phonemes (e.g. input text xerox is translated into {{mono|zi@r0ks}} for pronunciation).
  2. pronunciation phonemes are synthesized into sound e.g., {{mono|zi@r0ks}} is voiced as {{audio|Monotone-xerox pronunciated by eSpeakNG.ogg|{{mono|zi@r0ks}} in monotone way|help=no}}

To add intonation for speech i.e. prosody data are necessary (e.g. stress of syllable, falling or rising pitch of basic frequency, pause, etc.) and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: {{mono|z'i@r0ks}} which provides more natural speech: {{audio|Xerox pronunciated by eSpeakNG.ogg|{{mono|z'i@r0ks}} with intonation|help=no}}

For comparison two samples with and without prosody data:

  1. {{mono|DIs Iz m0noUntoUn spi:tS}} is spelled {{audio|Monotone speech sample by eSpeakNG.ogg|in monotone way|help=no}}
  2. {{mono|DIs Iz 'Int@n,eItI2d sp'i:tS}} is spelled {{audio|Intonated speech sample by eSpeakNG.ogg|intonated way|help=no}}

If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.

= 2. step – sound synthesis from prosody data =

The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer:{{cite web|authorlink=Dennis H. Klatt|first=Dennis H.|last=Klatt|year=1979|url=http://www.fon.hum.uva.nl/david/ma_ssp/2010/Klatt-1980-JAS000971.pdf|title=Software for a cascade/parallel formant synthesizer|publisher=J. Acoustical Society of America, 67(3) March 1980}}

  1. The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by additive synthesis adding together sine waves to make the total sound. Unvoiced consonants e.g. /s/ are made by playing recorded sounds,{{Cite web|url=https://github.com/espeak-ng/espeak-ng/tree/master/phsource/ufric|title=espeak-ng|website=GitHub}} because they are rich in harmonics, which makes additive synthesis less effective. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded sample of unvoiced sound.
  2. The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. But, it also produces sounds by subtractive synthesis by starting with generated noise, which is rich in harmonics, and then applying digital filters and enveloping to filter out necessary frequency spectrum and sound envelope for particular consonant (s, t, k) or sonorant (l, m, n) sound.

For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.

Languages

eSpeakNG performs text-to-speech synthesis for the following languages:{{Cite web|url=https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md|title = ESpeak NG Text-to-Speech|website = GitHub|date = 13 February 2022}}

{{div col|colwidth=16em}}

  1. AfrikaansButgereit, L., & Botha, A. (2009, May). [https://www.researchgate.net/profile/Adele_Botha/publication/30511664_Hadeda_the_noisy_way_to_practice_spelling_vocabulary_using_a_cell_phone/links/02e7e52a0861732946000000.pdf Hadeda: The noisy way to practice spelling vocabulary using a cell phone]. In The IST-Africa 2009 Conference, Kampala, Uganda.
  2. AlbanianHamiti, M., & Kastrati, R. (2014). [https://web.archive.org/web/20161227202019/https://pdfs.semanticscholar.org/6835/2a30a7017fafefba49e38bd07bb273e88cbe.pdf Adapting eSpeak for converting text into speech in Albanian]. International Journal of Computer Science Issues (IJCSI), 11(4), 21.
  3. Amharic
  4. Ancient Greek
  5. Arabic1
  6. AragoneseKayte, S., & Gawali, D. B. (2015). Marathi Speech Synthesis: A review. International Journal on Recent and Innovation Trends in Computing and Communication, 3(6), 3708-3711.
  7. Armenian (Eastern Armenian)
  8. Armenian (Western Armenian)
  9. Assamese
  10. Azerbaijani
  11. Bashkir
  12. Basque
  13. Belarusian
  14. Bengali
  15. Bishnupriya Manipuri
  16. Bosnian
  17. Bulgarian
  18. Burmese
  19. Cantonese
  20. Catalan
  21. Cherokee
  22. Chinese (Mandarin)
  23. Croatian
  24. Czech
  25. Chuvash
  26. Danish
  27. Dutch
  28. English (American)
  29. English (British)
  30. English (Caribbean)
  31. English (Lancastrian)
  32. English (New York City)5
  33. English (Received Pronunciation)
  34. English (Scottish)
  35. English (West Midlands)
  36. Esperanto
  37. Estonian
  38. Finnish
  39. French (Belgian)
  40. French (Canada)
  41. French (France)
  42. Georgian
  43. German
  44. Greek (Modern)
  45. Greenlandic
  46. Guarani
  47. Gujarati
  48. Hakka Chinese3
  49. Haitian Creole
  50. Hawaiian
  51. Hebrew
  52. Hindi
  53. Hungarian
  54. Icelandic
  55. Indonesian
  56. Ido
  57. Interlingua
  58. Irish
  59. Italian
  60. Japanese4Pronk, R. (2013). [http://dare.uva.nl/cgi/arno/show.cgi?fid=494975 Adding Japanese language synthesis support to the eSpeak system]. University of Amsterdam.
  61. Kannada
  62. Kazakh
  63. Klingon
  64. Kʼicheʼ
  65. KonkaniMohanan, S., Salkar, S., Naik, G., Dessai, N. F., & Naik, S. (2012). Text Reader for Konkani Language. Automation and Autonomous System, 4(8), 409-414.
  66. Korean
  67. Kurdish
  68. Kyrgyz
  69. Quechua
  70. Latin
  71. Latgalian
  72. Latvian
  73. Lingua Franca Nova
  74. Lithuanian
  75. Lojban
  76. Luxembourgish
  77. Macedonian
  78. Malay
  79. Malayalam
  80. Maltese
  81. Manipuri
  82. Māori
  83. Marathi
  84. Nahuatl (Classical)
  85. Nepali
  86. Norwegian (Bokmål)
  87. Nogai
  88. Oromo
  89. Papiamento
  90. Persian
  91. Persian (Latin alphabet)2
  92. Polish
  93. Portuguese (Brazilian)
  94. Portuguese (Portugal)
  95. PunjabiKaur, R., & Sharma, D. (2016). [https://www.irjet.net/archives/V3/i4/IRJET-V3I4102.pdf An Improved System for Converting Text into Speech for Punjabi Language using eSpeak]. International Research Journal of Engineering and Technology, 3(4), 500-504.
  96. Pyash (a constructed language)
  97. Quenya
  98. Romanian
  99. Russian
  100. Russian (Latvia)
  101. Scottish Gaelic
  102. Serbian
  103. Setswana
  104. Shan (Tai Yai)
  105. Sindarin
  106. Sindhi
  107. Sinhala
  108. Slovak
  109. Slovenian
  110. Spanish (Spain)
  111. Spanish (Latin American)
  112. Swahili
  113. Swedish
  114. Tamil
  115. Tatar
  116. Telugu
  117. Thai
  118. Turkmen
  119. Turkish
  120. Uyghur
  121. Ukrainian
  122. Urarina
  123. Urdu
  124. Uzbek
  125. Vietnamese (Central Vietnamese)
  126. Vietnamese (Northern Vietnamese)
  127. Vietnamese (Southern Vietnamese)
  128. Welsh

{{div col end}}

  1. Currently, only fully diacritized Arabic is supported.
  2. Persian written using English (Latin) characters.
  3. Currently, only Pha̍k-fa-sṳ is supported.
  4. Currently, only Hiragana and Katakana are supported.
  5. Currently unreleased; it must be built from the latest source code.

See also

References

{{Reflist}}