In September I had been to Graz (Austria) to attend the InterSpeech 2019. I talked about the work that I was presenting there in another post. I also want to share some pictures of the city, a very quiet and nice place to allocate this amazing conference.
This year I was presenting our work on ASR in Graz at the InterSpeech conference, one of the most relevant conferences about speech technologies. I had the opportunity to talk and interact with people from the top-companies regarding ASR, such as people from Amazon Alexa’s or Apple Siri’s team. It was very enlightening to discuss the problems that you should face in the industry in contrast to academia, where we have more freedom to explore and risk.
Regarding our work, we presented our one-pass ASR system that benefits from the neural network-based state-of-the-art language models. To provide some insights on this, I will elaborate a bit to provide some background.
Generally speaking, the ASR system used to perform the recognition is usually called the decoder, as they decode the audio signal/utterance (sequence of vectors representing the audio signal) into words (sequence of strings). During the decoding, you work at acoustic level (the input is the audio information), and this part is managed by the acoustic model, while the structure of the search and the sequences of words that you consider are managed by the language model.
In general, the decoding process involves using these models during one, two or more iterative steps to obtain the final sequence of words at the output. The point is to reduce the number of different hypotheses considered on each step to focus the search of the best ones. In most of the ASR frameworks (i.e: Kaldi as the most popular one) the first decoding (first-pass / one-pass) is critical, as you start from scratch and you should consider several hypotheses to perform the decoding. After performing this first step of decoding, you can produce some kind of compact representation of a reduced set of hypotheses that you can refine in posterior steps (second-pass), usually in a graph form called “word lattice” that represents the potential combinations of word sequences, along with different scores.
An example of word lattice (click to enlarge)
The key idea is that the kind of models that you can use on each step is limited: during the first pass, as several hypotheses are competing and the search space is huge, you should limit the complexity of the model that you use, well due to the size of the model or the computational complexity of getting the scores. These limitations affect the language model (LM) directly, as the acoustic model (AM) is not as integrated during the search as the LM. AM could be considered as a feature extractor that provides vectors of features to the decoder to process. As the decoding is usually performed time-synchronous, we can say that for each query to the AM at each time step, the LM has been queried thousands of times.
Due to this limitation, the use of the complex and big LMs or state-of-the-art neural-based LMs was mostly limited to the second step of recognition, where the set of hypotheses is reduced in the word lattice and the decoding is not so demanding. This involves that you should perform two steps to leverage the best LMs, and you don’t benefit from them during the first step, potentially reducing the performance of the system.
There are several approaches that propose methods to solve this issue, but they usually involve limiting the potential of the neural model using simpler models (i.e: Feedforward neural networks) or increasing the time required to perform the first-pass decoding substantially if they use the complex ones (i.e: Long Short-Term Memory (LSTM) recurrent neural networks). In addition to the complexity of these models regarding their structure, this kind of neural networks poses an additional challenge. Unlike Feedforward neural networks that are stateless (they process the input and provide an output based only on this input) their internal state should be kept (LSTMs process not only the input but their internal state to provide the output) during the whole search process.
In our work, we managed to integrate the current best neural-based LM, an LM based on LSTM recurrent neural networks, during the first-pass of decoding. Additionally, we perform this integration keeping the speed of the decoder under the real-time regime. You can check all the details of this work in our technical paper. I’m going to try to summarize informally the main contributions very briefly with some results and conclusions.
Regarding the integration, our decoder follows a structure that allows us to use the LMs efficiently and with fewer limitations, unlike most typical Kaldi decoders. The structure that we follow is very similar to the one proposed in this thesis (advance topic!). In brief, this decoder has an internal organization that eases the use of advance LMs and LMs techniques such as LM look-ahead, a technique that helps to guide the search. There are other aspects such us advance pruning methods that also help to reduce the decoding effort without reducing the accuracy. Concerning LSTM LM itself, the main improvements were related to the use of GPU to alleviate the computational demand and the use of the variance regularization (VR) that reduces the complexity of getting the scores of neural models. If you want to know more, I refer to the paper where you can find a detailed description of our system and the techniques that we have included.
Some results, in ASR the main metric is the Word Error Rate (WER), that gives an idea of how far is your output from the correct sentence. It is an error rate so, the lower the better. To provide some figures, based on my experience, about 15-20% WER involves that the output has enough quality to be used to assist professional transcribers, and 5-10% is considered to have enough quality to provide useful transcriptions without supervision, i.e: for MOOC courses. On the other hand, as we are concerned about the speed of the decoder, we need another metric that can help us with that. In this case, we used Real-Time Factor (RTF) a very straightforward metric that is computed with the following formula:
\[RTF = \frac{\text{time to recognize the utterance}}{\text{utterance duration}}\]
For example, 1 hour of audio that it’s transcribed in 1 hour has an RTF of 1, if that is transcribed in 30 minutes, it has an RTF of 0.5. On beneath of all this, we wanted a system that can work at <1 RTF, because this will allow us to think on streaming applications for the future with this decoder.
Now that we have the metrics, let’s talk about the tasks to evaluate our approach. We selected the academic well-known datasets LibriSpeech (LS) and TED-LIUM release 3 (TL). LS is a big corpus with ~1K hours for training with people reading books. On the other hand, TL contains ~400 hours of TED talks.
To summarize the results, I will include just one figure from the paper that illustrates the comparison among the decoders that we have in our framework: HCS one-pass, HCS two-pass and WFST. WFST is the common decoder structure that is implemented in Kaldi, and HCS is the structure we are following for this work, with one or two steps of decoding. Results show that the one-pass decoder produces significant improvements in WER compared to WFST, especially when RTF is greater than 0.4. Considering a very similar RTF performance, the one-pass decoding approach achieves relative improvements in WER ∼12% and ∼6% in LibriSpeech and TED-LIUM, respectively. Comparing HCS decoders, one-pass shows a consistent improvement in RTF, reducing WER (∼6%) in the case of LibriSpeech while obtaining a similar accuracy in TED-LIUM but just performing one decoding step.
Decoders comparison with TED-LIUM
Decoder comparison with LibriSpeech
To conclude, in this work we have developed and evaluated our novel one-pass decoder that integrates a state-of-the-art LM based on LSTM RNN efficiently, obtaining very competitive results with an RTF (<1) that paves the way to consider the use of this system under streaming conditions. Indeed, our future efforts will go in that direction in order to obtain a streaming ASR that integrates the best neural-based models.
I had the opportunity to experience a different way of doing research, learning about RASR and RETURNN, two powerful frameworks to perform state-of-the-art ASR. I met very interesting and smart people there that are some of the top researchers in ASR. I enjoyed sharing the office and having interesting discussions with Wei Zhou and sharing lectures, presentations and other activities with the rest of the team. It was a very nice place to extend my knowledge and research experience. In addition, I want to thank Priv.-Doz. Dr. rer. nat. Ralf Schlüter to take the time to talk and discuss ideas and proposals with me for future research regarding streaming ASR. I hope we can collaborate in the near future. And finally, I want to thank Prof. Dr.-Ing. Hermann Ney to give me the opportunity to go through this fulfilling experience.
I visited Cleveland again in May to share some of my Python projects, in particular my poster the SLEPc example to compute the Eigenvalues that model the stability of a nuclear reactor. A very cool application of linear algebra. This is the second version of the poster that I was presenting in Edinburgh at EuroPython.
It’s always a pleasure to meet those people around the world and share and learn about Python. Some known faces and new ones, meeting them is always an enrichment experience. In addition, my poster had the honour to receive the visit of Python’s creator Guido van Rossum, so the experience couldn’t have been better.
Along with this version of the poster, I attach some pictures of Cleveland and the conference.
My group, the MLLP research group, was participating jointly with the i6 from RWTH Aachen. I was the proud presenter of our work at IberSPEECH2018 in Barcelona, presenting our speech-to-text systems for the IberspeechRTVEchallenge. We won both tracks of this international challenge on TV show transcription!.
This competition aims to evaluate the state-of-the-art systems in ASR (Automatic Speech Recognition) for broadcast speech transcription. It comprised two categories:
Closed condition: Limiting the systems to be trained with the RTVE2018 database that is provided for the organizer.
Open condition: Systems are allowed to use any data to train the models.
Some of these shows are really challenging, as their content involves different speakers talking at the same time, very noisy field reports, and all kind of real-life situations like these. In the case of the closed condition, there was a tedious part related to the data cleaning and filtering that was a headache, and it made that most teams did not send their systems for evaluation.
This is the poster where we introduced our systems:
And the PDF version in this link and the paper that we submitted for this conference where our system is described in detail.
This is the second poster that I presented during the EuroPython 2018, the gist is how to deploy the required infrastructure to create a TensorFlow cluster, and then provision the software to train a Deep Learning model. For doing this, I used the Infrastructure Manager (http://www.grycap.upv.es/im/index.php) that supports API’s from different virtual platforms, making user applications Cloud-agnostic.
IM also integrates a contextualization system, based on Ansible, to enable the installation and configuration of all the required applications providing a fully functional Deep Learning infrastructure on the Cloud provider that we need.
After PyConUS, I visited Edinburgh to attend to the EuroPython 2018, the largest Python conference in Europe, presenting a poster concerning some part of my work during the M.Sc in parallel and distributed computing.
I wanted to introduced the library SLEPc, developed in Universidad Politécnica de València, and the bindings for using this library with Python: SLEPc4py, as well as MPI and MPI4py. All the information is gathered in the poster that can be found here.
In addition, you can find all the recorded talks in the EuroPython’s channel on Youtube.
I’m enjoying these days an amazing experience at PyCon 2018 in Cleveland. A lot of firsts: first time in USA, first time at an international PyCon, first time presenting a poster at a conference like this, and so many more firsts, but this is a great experience that I will remember forever.
Thanks a lot to all the people that come to visit my poster and show interest in it!!!
You guys do this so big and awesome!!
After some requests, I’m going to put my poster available (and the LaTeX sources when I have the time, if someone is interested in) in the following link: exploring-generative-models-pycon2018.
And the low-res JPEG version as an image here:
“Exploring generative models with Python” – PyCon2018 Poster
And, one more time, THANK YOU for being so kind to me :).
The idea of this post is to show how you can deploy a basic TensorFlow architecture to train a model, using AWS and the tool Infrastructure Manager. All the code and scripts are on GitHub.