- Microsoft last year demonstrated service using additional device
- Conversation Transcription Service works without any specific hardware
- The new service is currently designed for small meetings
Microsoft at its annual Build developer conference on Monday showed off an advanced speech-to-text offering called Conversation Transaction Service that is a part of the Azure Speech Services. The new service is designed to not only recognise the voice but also correctly identify the speakers and transcribe their conversations in real time. Last year, Microsoft notably introduced a device prototype that had an array of microphones to identify attendees in meetings and enable real-time transcriptions. That hardware will be available as a reference through a Developer Device Kit. However, the new offering is touted to enable transcriptions in meetings without the use of any specific hardware.
During the keynote session at the Build 2019 conference in Seattle on Monday, Sonia Dara, Product Marketing Manager, Surface Commercial, Microsoft, demonstrated the Conversation Transcription Service offering that is currently available through Azure Speech Services. The new addition comes as an extension of the development that was showcased last year, using dedicated conical-shaped hardware. However, as we mentioned earlier, the new service is claimed to work as a device-agnostic.
Dara demonstrated how the Conversation Transcription Service enabled real-time transcription of a conversation between three speakers simply using the inbuilt microphones of a laptop and two available smartphones. The new service recognises the voice, identifies speakers, and offers transcription. Organisations can also train the language model of Azure Speech Services through Microsoft 365 to add their unique vocabulary and jargon.
Additionally, developers can pair Conversation Transcription Service with the Speech Devices SDK to optimise the experience of multi-microphone devices. The service is designed to work with multichannel audio streams and user profiles as inputs to identify speakers and generate speakers.