NETtalk (artificial neural network)

File:NETtalk-Back-propagation.jpg

NETtalk is an artificial neural network that learns to pronounce written English text by supervised learning. It takes English text as input, and produces a matching phonetic transcriptions as output.

It is the result of research carried out in the mid-1980s by Terrence Sejnowski and Charles Rosenberg. The intent behind NETtalk was to construct simplified models that might shed light on the complexity of learning human level cognitive tasks, and their implementation as a connectionist model that could also learn to perform a comparable task. The authors trained it by backpropagation.Sejnowski, Terrence J., and Charles R. Rosenberg. "[https://content.wolfram.com/sites/13/2018/02/01-1-10.pdf Parallel networks that learn to pronounce English text]." Complex systems 1.1 (1987): 145-168.

The network was trained on a large amount of English words and their corresponding pronunciations, and is able to generate pronunciations for unseen words with a high level of accuracy. The success of the NETtalk network inspired further research in the field of pronunciation generation and speech synthesis and demonstrated the potential of neural networks for solving complex NLP problems. The output of the network was a stream of phonemes, which fed into DECtalk to produce audible speech, It achieved popular success, appearing on the Today show.{{Cite book |last=Sejnowski |first=Terrence J. |title=The deep learning revolution |date=2018 |publisher=The MIT Press |isbn=978-0-262-03803-4 |location=Cambridge, Massachusetts London, England}}{{Pg|page=115}}

From the point of view of modeling human cognition, NETtalk does not specifically model the image processing stages and letter recognition of the visual cortex. Rather, it assumes that the letters have been pre-classified and recognized. It is NETtalk's task to learn proper associations between the correct pronunciation with a given sequence of letters based on the context in which the letters appear.

Training

The training dataset was a 20,008-word subset of the Brown Corpus, with manually annotated phoneme and stress for each letter. The development process was described in a 1993 interview. It took three months -- 250 person-hours -- to create the training dataset, but only a few days to train the network.{{Cite book |url=https://direct.mit.edu/books/book/4886/Talking-NetsAn-Oral-History-of-Neural-Networks |title=Talking Nets: An Oral History of Neural Networks |date=2000-02-28 |publisher=The MIT Press |doi=10.7551/mitpress/6626.001.0001 |isbn=978-0-262-26715-1 |language=en |editor-last1=Anderson |editor-last2=Rosenfeld |editor-first1=James A. |editor-first2=Edward }}See nettalk.names file in the original dataset file. https://archive.ics.uci.edu/dataset/150/connectionist+bench+nettalk+corpus

After it was run successfully on this, the authors tried it on a phonological transcription of an interview with a young Latino boy from a barrio in Los Angeles. This resulted in a network that reproduced his Spanish accent.{{Pg|page=115}}

The original NETtalk was implemented on a Ridge 32, which took 0.275 seconds per learning step (one forward and one backward pass). Training NETtalk became a benchmark to test for the efficiency of backpropagation programs. For example, an implementation on Connection Machine-1 (with 16384 processors) ran at 52x speedup. An implementation on a 10-cell Warp ran at 340x speedup.{{Cite book |last1=Pomerleau |last2=Gusciora |last3=Touretzky |last4=Kung |chapter=Neural network simulation at Warp speed: How we got 17 million connections per second |date=1988 |title=IEEE International Conference on Neural Networks |chapter-url=https://doi.org/10.1109/icnn.1988.23922 |publisher=IEEE |pages=143–150 vol.2 |doi=10.1109/icnn.1988.23922|isbn=0-7803-0999-5 }}{{Cite journal |last1=Borkar |first1=S. |last2=Cohn |first2=R. |last3=Cox |first3=G. |last4=Gleason |first4=S. |last5=Gross |first5=T. |date=1988-11-01 |title=iWarp: an integrated solution of high-speed parallel computing |url=https://dl.acm.org/doi/abs/10.5555/62972.63015 |journal=Proceedings of the 1988 ACM/IEEE Conference on Supercomputing |series=Supercomputing '88 |location=Washington, DC, USA |publisher=IEEE Computer Society Press |pages=330–339 |isbn=978-0-8186-0882-7}}

The following table compiles the benchmark scores as of 1988.{{Cite journal |last1=Blelloch |first1=Guy |last2=Rosenberg |first2=Charles R. |date=1987-08-23 |title=Network learning on the connection machine |url=https://dl.acm.org/doi/abs/10.5555/1625015.1625081 |journal=Proceedings of the 10th International Joint Conference on Artificial Intelligence - Volume 1 |series=IJCAI'87 |location=San Francisco, CA, USA |publisher=Morgan Kaufmann Publishers Inc. |pages=323–326 }} Speed is measured in "millions of connections per second" (MCPS). For example, the original NETtalk on Ridge 32 took 0.275 seconds per forward-backward pass, giving $\frac{18629/10^6}{0.275} = 0.068$ MCPS. Relative times are normalized to the MicroVax.

class="wikitable"

|+ Performance Comparison (as of 1988)

! System

! MCPS

! Relative Time

MicroVax

| 0.008

| 1

Sun 3/75

| 0.01

| 1.3

VAX-11 780

| 0.027

| 3.4

Sun 160 with FPA

| 0.034

| 4.2

DEC VAX 8600

| 0.06

| 7.5

Ridge 32

|0.07

|8.8

Convex C-1

|1.8

|225

16,384-core CM-1

|2.6

|325

Cray-2

| 7

| 860

65,536-core CM-1

| 13

| 1600

10-cell Warp

|17

|2100

10-cell iWarp

|36

|4500

Architecture

The network had three layers and 18,629 adjustable weights, large by the standards of 1986. There were worries that it would overfit the dataset, but it was trained successfully.

The input of the network has 203 units, divided into 7 groups of 29 units each. Each group is a one-hot encoding of one character. There are 29 possible characters: 26 letters, comma, period, and word boundary (whitespace). To produce the pronunciation of a single character, the network takes the character itself, as well as 3 characters before and 3 characters after it.

The hidden layer has 80 units.

The output has 26 units. 21 units encode for articulatory features (point of articulation, voicing, vowel height, etc.) of phonemes, and 5 units encode for stress and syllable boundaries.

Sejnowski studied the learned representation in the network, and found that phonemes that sound similar are clustered together in representation space. The output of the network degrades, but remains understandable, when some hidden neurons are removed.{{cite web | title=Learning, Then Talking | website=The New York Times | date=August 16, 1988 | url=https://www.nytimes.com/1988/08/16/science/learning-then-talking.html | ref={{sfnref|The New York Times|1988}} | access-date=November 4, 2024}}

References

External links

[https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Nettalk+Corpus) Original NETtalk training set]
[https://www.nytimes.com/1988/08/16/science/learning-then-talking.html New York Times article about NETtalk]

Category:Artificial neural networks

Category:Speech synthesis