Contributing to society through open-source products The developers' thoughts behind ReazonSpeech v2, an evolved Japanese speech recognition model.
In February 2024, the released ReazonSpeech v2 attracted attention as a Japanese speech corpus (database) and a highly accurate Japanese speech recognition model. Compared to its predecessor, ReazonSpeech, the new version has made significant advancements: the corpus size has increased by 1.8 times to 35,000 hours, and the model's speech recognition speed has improved approximately sevenfold. We spoke with Mr. Mori and Mr. Suenaga from the Reazon Human Interaction Research Institute, developers behind the ongoing updates of ReazonSpeech, to learn more about their goals and aspirations.
Daijiro Mori Director, Reazon Holdings Co., Ltd.
Graduated from Kyushu Institute of Design with a degree in Acoustic Design. Engaged in research and development of information retrieval technology at Nippon Telegraph and Telephone Corporation's Human Interface Laboratory. Developed full-text search software at Future Search Brazil Co., Ltd. Established the Human Interaction Research Institute at Reazon Holdings Co., Ltd.
Tasuku Suenaga Head of Research Planning Department, Reazon Holdings Co., Ltd.
Graduate of the Tokyo College of Engineering. Holds a wide range of experiences including in SNS marketing companies, major IT corporations, and has started two businesses. Experienced in developing blog search systems, full-text search-related systems, and game app platforms. Joined current position in July 2023 upon invitation by Director Mori, a former colleague. Currently focuses on developing multilingual communication support apps and acquiring AI talent.
Creating Convenient Interfaces with Speech Recognition
At the Reazon Human Interaction Research Institute, you're focusing on developing open-source products. Could you tell us why you chose speech recognition technology for this endeavor?
Speech is something very familiar to all of us. One advantage of using speech for communication is its quick and effortless responsiveness compared to written language. If we could use this not only for face-to-face conversations but also for operating computers and smartphones, it would be very convenient. Just like typing on a keyboard or tapping with fingers, integrating speech recognition into user interfaces seems beneficial. There are other reasons, but this is one of the motivations behind our development of ReazonSpeech.
ReazonSpeech, released in January 2023, was notable for its extensive corpus of 19,000 hours. With ReazonSpeech v2 released recently, this corpus has expanded to 35,000 hours. Such a massive and openly accessible corpus is unprecedented.
Before ReazonSpeech's release, there were no large-scale corpus available for unrestricted commercial use in Japan. The fact that it's a freely usable speech corpus is one of ReazonSpeech's selling points.
Was this achievement due to some technological advancement?
Previously, if you wanted freely usable speech data without copyright restrictions, you had to invite people to labs or studios to speak into microphones. ReazonSpeech automated the processing of TV One-Seg broadcast data to create corpus, enabling large-scale corpus expansion.
In Japan, the 2019 copyright law revision allowed using copyrighted content for data analysis, redistributing it, and even using it commercially ahead of other countries. However, despite this progressive legislation, initially, everyone cautiously watched each other. Since no one was utilizing such a beneficial law, we thought why not take the lead and utilize it ourselves. Thus, ReazonSpeech was born from such aspirations.
Utilizing over-the-air broadcasts to achieve a large corpus
I heard it took 3 to 4 years from the law revision to the actual release.
I joined Reazon four years ago. I had been accumulating data steadily because I wanted to create such a corpus, but for speech recognition, it's necessary to organize not just speech but also subtitle data, or rather, textual data. Many over-the-air broadcasts include subtitle information. Before, we couldn't do this, but the law revision made it possible to use such data under a free license, which was significant.
What were the challenges in using over-the-air one-segment broadcasts?
The challenge was system development. Even if both speech and subtitle data flow during a program, they don't necessarily flow at the same time. Moreover, subtitle data may not accurately convert speech into text. Therefore, in actual development, we started by transcribing speech with existing speech recognition models, comparing the results with subtitles in the program to find matches. We turned these matched speech and subtitle pairs into corpus data. This matching process is called 'alignment,' and writing the program to achieve this took time and effort.
Just hearing about it sounds like a daunting task.
Another issue was that the existing speech recognition models used for alignment had to be open and free. Even if they were highly accurate, if their licenses weren't free, the corpus created with them couldn't be freely distributed. Therefore, initially, we used free speech recognition models for alignment and created small corpora. We repeated this process more than ten times, using these corpora to create speech recognition models, aligning them to create a larger corpus.
Real-time transcription with seven times the speed and punctuation
So, ReazonSpeech with its 19,000 hours of corpus data resulted from this. ReazonSpeech v2, released just one year later, has significantly improved. What are the differences from version 1?
Not only has the corpus expanded to 35,000 hours, but in the past year, several open-source speech recognition models have been released. We have incorporated NVIDIA's NeMo open-source toolkit into ReazonSpeech v2. As a result, the speech recognition speed has increased sevenfold compared to version 1. Normally, there's a trade-off between speed and accuracy, but ReazonSpeech v2 manages to maintain both at a high level. This is the selling point of ReazonSpeech v2.
It seems the practicality has been greatly enhanced.
Previously with ReazonSpeech, if you had 60 minutes of recording, transcription would be completed in around 10 minutes. I thought that was already quite practical, but with ReazonSpeech v2, it's now capable of real-time transcription while speaking. This opens up possibilities like displaying speech as text on screen during remote conversations or creating text in real-time. You can also input text by voice instead of using a keyboard when writing longer passages on a computer. Additionally, it allows scenarios such as real-time reading of what a hearing-impaired person is saying during a conversation with a hearing person. Its application extends to various situations where real-time capability is crucial.
I hear that version 2 now includes punctuation marks, which were not present in version 1.
In ReazonSpeech v2, we trained the model not to omit punctuation marks from the training data. As a result, punctuation marks now appear naturally.
In version 1, we had a process that removed unnecessary symbols and spaces to ensure accurate transcription. This included cleanly removing punctuation marks. However, we decided to include them this time because we felt commas and periods are necessary. Nevertheless, to ensure these punctuation marks don't disrupt the coherence of the text, we carefully prepared the training data for machine learning.
It feels like there has been considerable evolution in just one year from our previous discussion. Will future updates continue at this pace?
This has been our philosophy since releasing version 1; we believe continuous releases are crucial. "Language" evolves along with societal changes. New words emerge, and vocabularies rapidly change. If our speech corpus doesn't reflect these shifts, it could gradually recognize only outdated language. Therefore, we plan to update it annually.
Moving towards Multi-Modal Development with 'ReazonSpeech v3'
It's a bit early to ask, but what challenges are you planning for version 3 scheduled for next year?
One thing I'd really like to do is multi-modal learning, incorporating not only voice and text but also facial expressions of speakers. Humans use information from various senses—auditory, visual, olfactory, gustatory, tactile—to perceive the world. Our current models utilize only a part of this. Television broadcasts, which serve as our base, naturally include visual information, often showing the face of the speaker when their voice is heard. I think everyone can relate to the fact that during conversations, we not only listen to words but also observe the speaker's facial expressions to glean hints and understand their intentions. If our corpus includes such information, the model will become more sophisticated. We plan to achieve this in version 3.
Besides releasing advanced models, we also want to foster a community. ReazonSpeech is an open tool for anyone to use freely, so we want to create a space where users can share their experiences and contribute to corpus development.
In a broader sense, not just limited to ReazonSpeech, I'm interested in those who are intrigued by human interaction and wish to create open-source products that benefit society. If you share these values, we'd love to have you join us.
Conclusion
Using open innovation with the latest technologies like AI to invigorate society. ReazonSpeech and its successor, ReazonSpeech v2, embody this concept. At Reazon Human Interaction Research Institute and within Reazon Holdings, we are welcoming individuals who resonate with these values. We look forward to hearing from anyone interested.
Please feel free to contact us about REAZON’S content, interviews, and press relations.
CONTACTIf you are interested in REAZON HOLDINGS, please check the recruitment site for more information.
RECRUITWinner of the Grand Prize at TreeHacks2024, the largest university hackathon in the U.S.
I want to create new technologies that become common in the world, making a significant impact on society.