Comparison of Cloud Providers for STT and NLU Services

Comparison of Cloud Providers for STT and NLU Services

In last post we took an overview of ready to use API services for various AI branches. In this post we will venture into detailed comparison of Speech-To-Text (STT) and Natural Language Understanding (NLU) services portfolio. One might wonder why particularly these two branches. But considering that a lot of day-to-day interactions are vocal, and a lot of data can be extracted from these interactions. In order to do so, speech must first be converted to text and then later analyzed.

We will first list the noteworthy features of services of big 4 cloud providers – viz. Amazon, Google, IBM and Microsoft, and then provide a brief comparison. While listing features, attention has been paid to evaluate how many different characteristics the service provides, how easy it is to use the API and which interfaces (REST, WebSockets etc.) are available, how confidentially the data will be handled (important for GDPR!) and which natural languages are supported.

Amazon AWS Portfolio

Tools/Services/Products to use:

  • Text to speech service, can use audio files as well as voice streams (using WebSockets) 
  • Transcribe provides channel identification in voice data. This can be used for transcribing calls or multichannel audio conversation.
  • Custom vocabulary is supported but entries are constrained to 256 characters including separators like hyphen.
  • AWS documents various security techniques but does not mention whether the voice data will be used in enhancement of their own algorithms or not.
  • AWS seems to support a huge amount of audio data files for transcription. But there are overall limits on service usage. For example, only 100 concurrent transcription jobs are supported, or maximum size of custom vocabulary can only be 50 KB. What limits are applied to streaming data is unclear.
  • AWS Comprehend
  • Amazon Comprehend provides Key-phrase Extraction, Sentiment Analysis, Entity Recognition, Topic Modeling, and Language Detection APIs.
  • Custom entities and classification rules can be externally supplied with an AutoML model.
  • 5 KB is the maximum size of the document that is accepted for many types of analysis (e.g. document classification). Apart from that all other limits are ok.

Google Cloud Portfolio

Tools/Services/Products to use:

  • Supports 120 languages!
  • Streaming support with gRPC bi-directional stream. gRPC is global remote procedure call meant to simplify accessing and calling remote functions on any server.
  • A special “phone_call” machine learning model exists for transcribing speech from telephonic conversations, but it only currently supports English US language. It also costs more than the default recognition model. For all other languages, default trained model must be used.
  • Google offers to forbid data logging for privacy purposes.
  • Cloud Natural Language
  • The NL processing and understanding offering of GCP is interesting. It offers two avenues – custom processing models using Google AutoML or pretrained models. Even though AutoML way might look enticing, it has one of the biggest hurdles that models must be trained using custom data and therefore, may not be an easy task. Pretrained models offer all standard features like syntax detection, entity extraction etc. for multiple languages out of the box.
  • Currently supported languages are English, Spanish, Japanese, Chinese (simplified and traditional), French, German, Italian, Korean, Portuguese, and Russian
  • Output features available are sentiment analysis, entity analysis, entity sentiment analysis, syntactic analysis and content classification. Not much information has been revealed about which different sentiments can be detected.

Customization of pretrained models (using transfer learning) with custom vocabulary or other similar tools does not seem to be possible or at least not well documented.

IBM Watson Portfolio

  • IBM Watson Speech-To-Text works for audio files as well as streaming data. Even for streaming audio, results are only outputted for complete conversation i.e. conversation is ended. That means contextual information is retained.
  • Features:



Speaker and speaker label identification

Optimized for 2 speakers, can detect up to 6 speakers. Longer conversations are better detected than shorter utterances. Service generally takes 1 minute to stabilize and provide more accurate output.

Interim results

Provides interim transcription as audio progresses; Such transcription tends to change in final transcript

Keyword spotting with probability thresholds

Up to 1000 keywords can be spotted in final transcript. Probability threshold enables inclusion of keywords with increasing certainty

Word alternatives and alternative transcripts

Service can provide multiple word alternatives for an unclear utterance or can also provide multiple final transcripts with possible alternatives

Smart formatting

Available for US English, Japanese, and SpanishConverts following things into more conventional formatsDatesTimesSeries of digits and numbersPhone numbersCurrency valuesInternet email and web addresses


Only available for US English

Numerical redaction

Available for US English, Japanese, and SpanishRedaction, removal of numbers, sensitive numerical data. This feature disables keyword spotting, interim results as well as alternative transcripts

Profanity filtering

Only available for US English

  • There are also various processing metric features which can be retrieved. These metrics tell how much of the audio was received, processed and annotated etc. 
  • Conversation data can be deleted after transcription
  • Language support – Arabic (Modern Standard), Brazilian Portuguese, Chinese (Mandarin), English (United Kingdom and United States), French, German, Japanese, Korean, Spanish (Argentinian, Castilian, Chilean, Colombian, Mexican, and Peruvian)
  • Tone Analyzer
  • Tone analyzer is more of natural language portfolio service rather than voice/speech service. It can only process textual data.
  • Has special Customer Engagement Endpoint to analyze customer responses. It can be chained to Speech-To-Text result to find out the overall tone of the conversation.
  • Limitations: Can only process first 50 utterances with maximum 500 characters per utterance. In total 128 KB input data can be processed
  • Language support is very limited – English and French
  • Can identify following tones


Showing personal enthusiasm and interest


Defined as feeling annoyed and irritable


Being disrespectful and rude 


Defined as rational, goal-oriented behavior 


Regarded as an unpleasant passive emotion 


An effective response to perceived service quality 


An affective mode of understanding that involves emotional resonance 




Content classification; Category listing


High level concepts not necessarily mentioned in the text

Emotions (US English)

Detects anger, disgust, fear, joy, or sadness that is conveyed in the content or by the context


Find people, places, events, and other types of entities mentioned in content. Complete list of entities




Only for HTML and URL input


Relations between detected entities

Semantic roles

Parse sentences into subject-action-object form, and identify entities and keywords that are subjects or objects of an action


Positive or negative. Score between 0 to 1

  • Understanding models can be extended with custom modellingOnly following output features are available with custom language models


Entities, Relations


Entities, Relations

  • Language customization models can only extend entities, relations, or categories output. No other output features like concepts can be enhanced using custom modelling. Only in case of English language, targeted sentiment analysis is possible.

Microsoft Azure Portfolio

Tools/Services/Products to use:

  • Microsoft Azure specializes in transcription for telephony data where they have trained speech recognition machine learning models for call center calls including noisy and unclear recordings as well as real time audio. 
  • Microsoft bundles this technology as Azure Technology for Call Centers solution which does not only provide Speech-To-Text but also post call analytics including sentiment, key phrase extraction and search.
  • More than 20 languages are currently supported 
  • Real time STT is supported on WebSocket API
  • Azure has very deep customization support for transcription and can offer to customize (i.e. train the pretrained models further) either of 
  • Acoustic models (with accented speech, background noises etc.)
  • Language models (industry specific jargon and vocabulary etc.)
  • Pronunciation models (phonetics, acronyms, specific terms etc.)

This is by far the most advanced customization ability amongst all of the compared providers.

  • Text Analysis API offers sentiment analysis, key phrase extraction, language detection, and entity recognition. 
  • Unfortunately, the sentiment analysis output format is rather weak and only provides a score from 0 to 1 (or 100) where 0 being negative and 1 being positive. It does not recognize other sentiments like surprise or irony etc. Moreover, custom training in this area is not possible.
  • Maximum size of a single document is limited to 5120 characters or 5KB. Maximum size of one request is 1 MB.

So, all in all Azure provides perhaps the best capabilities for speech transcription but poor capabilities for language analysis later.

Comparison and Verdict

STT services from all the cloud providers seem to provide good features except Amazon. Not only Amazon STT capabilities are rudimentary, also their API is complex to use. Rest all the providers have their own unique capabilities. Google supports maximum number of languages. Microsoft Azure Cognitive Services STT is optimized for call center solutions and has the strongest customization support. And IBM Watson provides most features for output transcription.

In NLU services however, IBM Watson shines. Along with standard features like keyword and entity extraction and sentiment analysis, IBM offers concept detection, tone analysis and emotion detection which nobody seems to offer. All these functionalities are available over easy to use WebSocket as well as standard REST interface.

Overall, offering from IBM look very promising and detailed. IBM provides for both types of services full customizability has easy to use API and comprehensive documentation. Microsoft and Google come close for STT capabilities but fall short in their NLU offerings.