Say What? A Non-Scientific Comparison of Automated Transcription Services

An image showing a blue radio wave and a microphone, with the symbols for pause, stop and play.


Earlier this year, while attending a writers’ workshop, freelance science journalist Jessica Wapner stunned the room when the subject of transcription came up. She and another science writer there mentioned that they transcribed all their interviews, and that they did so themselves. “Everyone in the class looked at us like, what? Like we had six eyes,” Wapner recalls. The exchange caused her to rethink her process. Since then, she has become more selective about what she transcribes, and has even dabbled in using automated transcription. She heard about an automated transcription service called Temi and decided to give it a go. So far, she’s been pleased. “Automated transcription seems solid,” Wapner says. “I think when it’s incorrect it’s kind of obvious and so then you go and listen to the recording.”

For as long as it’s been possible to record interviews with sources, journalists have had to grapple with the decision about whether to transcribe this material, and if so how much. Reporters with abundant financial or institutional resources could hire another human being to listen and type out the interviews, using one of many well-established independent companies that offer human-based transcription services. But the rest had to do it themselves. Now a bevy of transcription services have emerged online that process audio automatically. People can upload their raw digital audio files to the websites of companies such as Otter, Temi, Trint, and Sonix, whose speech-to-text algorithms immediately get to work converting the data. While the cost of traditional human transcription typically is around $1 per minute of audio, automated services run as cheap as $0.10 per minute, or are sometimes free (with limitations). These lower costs have put automated transcription within reach for many journalists.

Science writers face a special challenge with outsourcing transcription—regardless of whether it’s to another human being or to a machine—because of the technical words that often come up in dialogue with sources. Katarina Zimmer, a freelance science and data journalist based in New York City, used to transcribe all of her interviews herself because of problems she encountered with automated transcription services. “I tried Trint but found it useless for science-related interviews that are mostly done in suboptimal conditions like Skype,” she says. For example, for an ecology story she reported, the automated service kept transcribing “dingoes” as “ding-dongs.”

Carl Zimmer (no relation to Katarina) says he thinks automated transcription has only recently passed an accuracy threshold that makes it useful. “I remember dreaming of this coming,” says Zimmer, a columnist for The New York Times and author of numerous books about science.

The flourishing options for automated transcription have left some reporters scratching their heads as to knowing which one is best. Comparisons of automated transcription services have come up with differing conclusions. A thorough appraisal from Poynter concluded that Trint was “the best all-around automatic transcription tool for journalists.” Meanwhile, a digital marketing agency called 47 Insights tested six transcription services and concluded that Otter and Descript came out on top among the automated options. The New York Times’ review site Wirecutter picked Temi. In PC Magazine’s comparison, there was a tie for best rating among several automated transcription services.


Putting Automated Services to the Test

To get a better idea of how different transcription services stacked up for science journalism, The Open Notebook sent two 15-minute interview samples for previously reported stories to four companies that offer automated transcription: Temi, Otter, Sonix, and Trint. We also sent the same samples to Rev, a traditional human-transcription service. Both conversations were in English and were recorded on an Olympus digital voice recorder. One interview had taken place by phone with a male geneticist born in Taipei, Taiwan, whose native language was not English and who spoke with an accent (the Olympus device received direct audio input via a landline phone connector), and the other was recorded in person with a female endocrinologist in Cambridge, U.K., who had a British accent.

For the phone interview with the scientist who was a non-native English speaker, the transcription service that came out on top in terms of accuracy was the traditional, human transcription service, Rev. It took 4 hours and 41 minutes for the transcript to be returned, but Rev’s accuracy (although not perfect) far surpassed that of the automated services.

One surprising result was how the different services transcribed the words “Cambrian Explosion,” a pivotal event around 540 million years ago when certain complex animals first appear in the fossil record. The human Rev transcriber could not decipher the words and noted them as “inaudible,” and a couple automated services even transcribed the phrase as “Canberra Explosion” (pity the residents of that Australian city!). Notably, however, Otter got it right.

The second interview, conducted with a native English speaker, came back more quickly from Rev than the first one did—it took exactly two hours. For this interview, the difference in accuracy between Rev and automated transcription services was marginal. For example, all the automated systems accurately relayed technical terms such as “hypothalamus” and “leptin.”

Among the automated systems we tested, we found little difference in performance or speed. All four returned transcriptions much faster than Rev did: They took between five and nine minutes to return results. And the accuracy of the machine-transcribed text seemed to have more to do with the quality of audio recording and clarity of pronunciation than what service was used.

Safeena Walji, a spokesperson for Rev (the parent company of Temi), says that journalists can increase the success of an automated transcription by, for example, using a high-quality microphone or reliable recording app to capture clear sound, and by waiting for sources to finish speaking to avoid cross-talk. “There are now apps or tools that you can also run an audio file through that can wipe out background noise,” she notes, referring to products such as Denoise (for iPhone), Video Noise Reducer (for Android) or Clear Cloud from Babble Labs. Doing so can help improve automated transcription quality, she adds.

Although most of the automated transcription tools have similar user interfaces, there are subtle differences. For example, excerpts copied and pasted from Trint transcripts will include timestamps, and Temi’s interface shows words in orange to indicate that the algorithm is less confident it has captured them correctly.

Carl Zimmer says the automated transcription services he’s tried have had well-designed user interfaces that include features such as a search bar and speed controller. These features can offset the drag of having to deal with transcription inaccuracies because they enable him to get to good quotes quickly. However, whenever he uses automated transcription, he always turns to the primary audio files to double check that the transcribed words are correct. As he says: “If I get it wrong, it’s on me, not some AI.”

Security Concerns

As science writers have begun uploading more files to online transcription services, some have wondered about the security implications of doing so, including questioning whether it’s wise to let outside companies handle their interview recordings. Such concerns came to the fore this past summer, when Google announced a temporary suspension of voice recording processing by its Google Assistant product in the European Union. According to CNBC, the company had learned that a contractor the company had partnered with to improve the tool’s accuracy had leaked snippets of more than 1,000 private conversations, including some that contained medical information, to a media outlet. Although Google Assistant isn’t a transcription service, the incident underscored that high-tech speech recognition systems are vulnerable to security issues.

The transcription services that The Open Notebook tested all say they have rigorous policies to ensure the security of customers’ files. For example, Rev (which owns Temi) says it uses bank-level security for its products, meaning that it encrypts all user data it stores or transmits, and that it does daily backups to a different secure location. It also permanently deletes customer files on request.

When contacted about security, Otter emphasized that it works with third parties to monitor possible malicious threats to its cloud-computing environment. It also purges data from users’ trash folders automatically after 30 days (or sooner if initiated by the user) and thereafter retains no copy of the recording or transcript. Otter notes that it does not honor requests via subpoena, but law enforcement can gain access to content via a search warrant.

Despite transcription companies’ assurances of data protection, the hazard of security breaches and unwanted handover of content to law enforcement still looms as a possibility. That’s why the idea of uploading sensitive interviews to a computer cloud such as those used by these automated transcription services doesn’t fly for some journalists. “I can’t take the risk for the work that I do,” says Charles Piller, an investigative correspondent for Science who never uses any transcription services. “I would just be too paranoid that, in the very unlikely circumstance that there was a data breach, that my sources would be vulnerable. My credibility as an investigative reporter would be at risk.” *


Accents versus Automation

Given the global nature of science and science journalism, it’s worth noting that automated transcription is available in some non-English languages. Trint, for example, offers support for 28 languages, including Hungarian, Hindi, Chinese Mandarin, Japanese, and Latvian. Beginning in September, the company also started offering a translation service. Users can now have the company translate their Trint transcripts from and to any of the 28 languages it supports. Sonix offers transcription and translation in a range of languages as well, including Arabic and Indonesian.

Accents can also complicate transcription, however. Journalists doing interviews in English often fret about how well automated transcription services work when at least one speaker has a non-American accent. Australian science journalist Dyani Lewis says automated transcription services will occasionally be thrown by her accent. But she says they do a worse job when transcribing her neighboring Kiwis: “I have to say that I’ve recently done a story with a few people from New Zealand and, boy, automated transcription really doesn’t cope well with New Zealand accents.”

Priyanka Pulla, a science and medical journalist who reports mainly from India, notes that she’s seen automated transcriptions trip up on specialist terms in her interviews. She wonders whether her accent might contribute to this issue. “I suspect—I obviously don’t know for sure, since I don’t have any real control sample—that it doesn’t work well with Indian accents,” Pulla says.

It’s not surprising that automatic transcription services would struggle with some accents, says Joseph Fridman, who researches the methods of science communication, including issues relating to the politics and practice of transcription, at the Interdisciplinary Affective Science Lab at Northeastern University/Massachusetts General Hospital. The computer algorithms that power those services develop a bias depending on the training audio used to create them, he says, and they impose that bias on what they process. “The machine brain eats certain types of language and makes everyone else conform to it,” Fridman explains.

That’s not just an inconvenience to journalists, Fridman cautions—it’s also a potential threat to diversity and inclusion in sourcing, because reporters might subconsciously select sources who are perceived as responding in Standard American English in order to ensure a cleaner transcription. He urges journalists to guard against this tendency to internalize the biases that automated transcription might encourage against speakers with accents. “Whenever anything says it’s going to make something easier for us, it’s very tempting to stop thinking critically about it.”



* Correction 12/13/19: In an earlier version of this story, Charles Piller was incorrectly quoted as describing himself as an “investigational” reporter when in fact he had described himself as an “investigative” reporter. This was noted correctly in the reporter’s notebook, but an error was introduced in typing the quote from the audio file, proving that humans—just like machines—are fallible.


Roxanne Khamsi Brian Friedman

Roxanne Khamsi is a science writer whose work has appeared in publications such as Scientific American, Wired, The Economist and The New York Times Magazine. She formerly oversaw news coverage at Nature Medicine. Follow her on Twitter at @rkhamsi. 

Skip to content