Automatic speech recognition in the context of iTalk2Learn

Speech recognition is crucial for adapting to each child’s individual performance in iTalk2Learn. So what does it mean and how does it work?

SAIL LABS Technology (Sail) is an Austrian SME providing automatic speech recognition (ASR) support within the iTalk2Learn platform. This blog post is the first one in a series of posts over the duration of the project providing background on the technology and challenges to be met when applying automatic speech recognition in the scope of virtual maths tutoring for children.

What is automatic speech recognition?

Automatic speech recognition, sometimes also called speech-to-text, aims to transcribe spoken words (audio) into text. Statistical and linguistic models are employed to create one or a series of possible transcriptions corresponding to a given audio input. In doing so, acoustic as well as linguistic factors are taken into account and combined to produce the resulting transcript. Training corpora have to be employed to train and test these statistical models. ASR systems typically allow for speaker-independent recognition but require different models for different languages and dialects.

The acoustic model

The acoustic model (AM) aims at modelling acoustic features, including speaker and channel properties. Recording environments (such as the type and quality of microphones) and speaker-dependent factors (such as age, gender, dialectal or sociological factors), all have to be reflected and modelled by the AM when transcribing audio.

The language model

The language model (LM) aims at modelling the vocabulary, phraseology and pronunciations within a given domain. Most ASR-systems use vocabularies based on words and statistical language models, reflecting the co-occurrence and combinations of words rather than using a syntactical model of language. The vocabulary has to be designed according to the task at hand, pronunciations have to be created for all words present in the vocabulary and textual data collected, representing the word-usage in representative contexts.

How does this work in the iTalk2Learn project?

In the context of the iTalk2Learn project, the AM has to be able to cope with children’s voices, as well as classroom conditions. A series of specific models for the recognition of children’s voices is being created for English and German, based on data collected by project partners, the Institute of Education (IOE) and Ruhr University Bochum (RUB), in British and German schools. These collections are complemented by corpora acquired from institutions such as the Linguistic Data Consortium (LDC) or the University of Erlangen, Germany. Together, these data sets are used for the creation of acoustic models reflecting as accurately as possible the types of voices encountered in the iTalk2Learn platform.

The mathematical context of iTalk2Learn determines the terminology and vocabulary which are used to create specific LMs for use within iTalk2Learn. Terminology from the field of fractions as well as general terms reflecting progress and interaction with the system have been collected from observations in schools and subsequently extended. In combination with common and frequent words these words and phrases form the training material for the LM of iTalk2Learn.

It should be emphasised that all models created do not allow to infer in any way the identity of any of the children whose voices were recorded and whose audio is used in the statistical training process.

If you have any questions about automatic speech recognition in the context of iTalk2Learn, please get in touch today!