Reported Features

Say What? A Non-Scientific Comparison of Automated Transcription Services

An image showing a blue radio wave and a microphone, with the symbols for pause, stop and play. — South_agency/iStock

Earlier this year, while attending a writers’ workshop, freelance science journalist Jessica Wapner stunned the room when the subject of transcription came up. She and another science writer there mentioned that they transcribed all their interviews, and that they did so themselves. “Everyone in the class looked at us like, what? Like we had six eyes,” Wapner recalls. The exchange caused her to rethink her process. Since then, she has become more selective about what she transcribes, and has even dabbled in using automated transcription. She heard about an automated transcription service called Temi and decided to give it a go. So far, she’s been pleased. “Automated transcription seems solid,” Wapner says. “I think when it’s incorrect it’s kind of obvious and so then you go and listen to the recording.”

For as long as it’s been possible to record interviews with sources, journalists have had to grapple with the decision about whether to transcribe this material, and if so how much. Reporters with abundant financial or institutional resources could hire another human being to listen and type out the interviews, using one of many well-established independent companies that offer human-based transcription services. But the rest had to do it themselves. Now a bevy of transcription services have emerged online that process audio automatically. People can upload their raw digital audio files to the websites of companies such as Otter, Temi, Trint, and Sonix, whose speech-to-text algorithms immediately get to work converting the data. While the cost of traditional human transcription typically is around $1 per minute of audio, automated services run as cheap as $0.10 per minute, or are sometimes free (with limitations). These lower costs have put automated transcription within reach for many journalists.

Science writers face a special challenge with outsourcing transcription—regardless of whether it’s to another human being or to a machine—because of the technical words that often come up in dialogue with sources. Katarina Zimmer, a freelance science and data journalist based in New York City, used to transcribe all of her interviews herself because of problems she encountered with automated transcription services. “I tried Trint but found it useless for science-related interviews that are mostly done in suboptimal conditions like Skype,” she says. For example, for an ecology story she reported, the automated service kept transcribing “dingoes” as “ding-dongs.”

Carl Zimmer (no relation to Katarina) says he thinks automated transcription has only recently passed an accuracy threshold that makes it useful. “I remember dreaming of this coming,” says Zimmer, a columnist for The New York Times and author of numerous books about science.

The flourishing options for automated transcription have left some reporters scratching their heads as to knowing which one is best. Comparisons of automated transcription services have come up with differing conclusions. A thorough appraisal from Poynter concluded that Trint was “the best all-around automatic transcription tool for journalists.” Meanwhile, a digital marketing agency called 47 Insights tested six transcription services and concluded that Otter and Descript came out on top among the automated options. The New York Times’ review site Wirecutter picked Temi. In PC Magazine’s comparison, there was a tie for best rating among several automated transcription services.

Putting Automated Services to the Test

To get a better idea of how different transcription services stacked up for science journalism, The Open Notebook sent two 15-minute interview samples for previously reported stories to four companies that offer automated transcription: Temi, Otter, Sonix, and Trint. We also sent the same samples to Rev, a traditional human-transcription service. Both conversations were in English and were recorded on an Olympus digital voice recorder. One interview had taken place by phone with a male geneticist born in Taipei, Taiwan, whose native language was not English and who spoke with an accent (the Olympus device received direct audio input via a landline phone connector), and the other was recorded in person with a female endocrinologist in Cambridge, U.K., who had a British accent.

For the phone interview with the scientist who was a non-native English speaker, the transcription service that came out on top in terms of accuracy was the traditional, human transcription service, Rev. It took 4 hours and 41 minutes for the transcript to be returned, but Rev’s accuracy (although not perfect) far surpassed that of the automated services.

One surprising result was how the different services transcribed the words “Cambrian Explosion,” a pivotal event around 540 million years ago when certain complex animals first appear in the fossil record. The human Rev transcriber could not decipher the words and noted them as “inaudible,” and a couple automated services even transcribed the phrase as “Canberra Explosion” (pity the residents of that Australian city!). Notably, however, Otter got it right.

The second interview, conducted with a native English speaker, came back more quickly from Rev than the first one did—it took exactly two hours. For this interview, the difference in accuracy between Rev and automated transcription services was marginal. For example, all the automated systems accurately relayed technical terms such as “hypothalamus” and “leptin.”

Among the automated systems we tested, we found little difference in performance or speed. All four returned transcriptions much faster than Rev did: They took between five and nine minutes to return results. And the accuracy of the machine-transcribed text seemed to have more to do with the quality of audio recording and clarity of pronunciation than what service was used.

Safeena Walji, a spokesperson for Rev (the parent company of Temi), says that journalists can increase the success of an automated transcription by, for example, using a high-quality microphone or reliable recording app to capture clear sound, and by waiting for sources to finish speaking to avoid cross-talk. “There are now apps or tools that you can also run an audio file through that can wipe out background noise,” she notes, referring to products such as Denoise (for iPhone), Video Noise Reducer (for Android) or Clear Cloud from Babble Labs. Doing so can help improve automated transcription quality, she adds.

Although most of the automated transcription tools have similar user interfaces, there are subtle differences. For example, excerpts copied and pasted from Trint transcripts will include timestamps, and Temi’s interface shows words in orange to indicate that the algorithm is less confident it has captured them correctly.

Carl Zimmer says the automated transcription services he’s tried have had well-designed user interfaces that include features such as a search bar and speed controller. These features can offset the drag of having to deal with transcription inaccuracies because they enable him to get to good quotes quickly. However, whenever he uses automated transcription, he always turns to the primary audio files to double check that the transcribed words are correct. As he says: “If I get it wrong, it’s on me, not some AI.”

To Transcribe or Not to Transcribe

Some extraordinary journalists who know shorthand can take continuous, verbatim notes on paper as someone speaks. Another rare breed of reporter is one who can type fast enough to capture the words of their sources accurately as the interview transpires in real time. Cara Giaimo, a freelance science journalist whose average typing speed approaches 100 words per minute, says she records conversations as back-up, but generally types everything as speakers talk. She says the practice has saved her “countless hours” of time and improved her typing skills.

The rest of us, however, have to grapple with how much—and how often—to transcribe interviews. To capture more than just fragments of an interview word-for-word, most reporters today rely on audio recording devices and either human or automated transcription afterward.

There are upsides to listening back to audio and transcribing it. For example, when a reporter does this, they can pick up on subtleties such as pauses and laughs that they might not have captured in their real-time notes, Wapner says.

Many journalists say that transcribing everything—either themselves or using a service—can cause unnecessary delays when they are doing fast-turnaround deadline reporting. Carl Zimmer explains that his conversations for newspaper articles are often between 15 and 30 minutes long and he can quickly zero in on the right place in his notes and recordings to pick up key quotes. He says complete transcriptions, using automated services, are more valuable to him for magazine features or books, where he might spend hours upon hours with sources and need to revisit key conversations in more detail.

“It’s a trade-off, and so you have to look at your budget—your budget of money and your budget of time,” Zimmer says. “The automated transcription companies are starting to change how I look at that trade-off because they’re easier and faster, and in a lot of cases, cheaper.”

Security Concerns

As science writers have begun uploading more files to online transcription services, some have wondered about the security implications of doing so, including questioning whether it’s wise to let outside companies handle their interview recordings. Such concerns came to the fore this past summer, when Google announced a temporary suspension of voice recording processing by its Google Assistant product in the European Union. According to CNBC, the company had learned that a contractor the company had partnered with to improve the tool’s accuracy had leaked snippets of more than 1,000 private conversations, including some that contained medical information, to a media outlet. Although Google Assistant isn’t a transcription service, the incident underscored that high-tech speech recognition systems are vulnerable to security issues.

The transcription services that The Open Notebook tested all say they have rigorous policies to ensure the security of customers’ files. For example, Rev (which owns Temi) says it uses bank-level security for its products, meaning that it encrypts all user data it stores or transmits, and that it does daily backups to a different secure location. It also permanently deletes customer files on request.

When contacted about security, Otter emphasized that it works with third parties to monitor possible malicious threats to its cloud-computing environment. It also purges data from users’ trash folders automatically after 30 days (or sooner if initiated by the user) and thereafter retains no copy of the recording or transcript. Otter notes that it does not honor requests via subpoena, but law enforcement can gain access to content via a search warrant.

Despite transcription companies’ assurances of data protection, the hazard of security breaches and unwanted handover of content to law enforcement still looms as a possibility. That’s why the idea of uploading sensitive interviews to a computer cloud such as those used by these automated transcription services doesn’t fly for some journalists. “I can’t take the risk for the work that I do,” says Charles Piller, an investigative correspondent for Science who never uses any transcription services. “I would just be too paranoid that, in the very unlikely circumstance that there was a data breach, that my sources would be vulnerable. My credibility as an investigative reporter would be at risk.” *

Accents versus Automation

Given the global nature of science and science journalism, it’s worth noting that automated transcription is available in some non-English languages. Trint, for example, offers support for 28 languages, including Hungarian, Hindi, Chinese Mandarin, Japanese, and Latvian. Beginning in September, the company also started offering a translation service. Users can now have the company translate their Trint transcripts from and to any of the 28 languages it supports. Sonix offers transcription and translation in a range of languages as well, including Arabic and Indonesian.

Accents can also complicate transcription, however. Journalists doing interviews in English often fret about how well automated transcription services work when at least one speaker has a non-American accent. Australian science journalist Dyani Lewis says automated transcription services will occasionally be thrown by her accent. But she says they do a worse job when transcribing her neighboring Kiwis: “I have to say that I’ve recently done a story with a few people from New Zealand and, boy, automated transcription really doesn’t cope well with New Zealand accents.”

Priyanka Pulla, a science and medical journalist who reports mainly from India, notes that she’s seen automated transcriptions trip up on specialist terms in her interviews. She wonders whether her accent might contribute to this issue. “I suspect—I obviously don’t know for sure, since I don’t have any real control sample—that it doesn’t work well with Indian accents,” Pulla says.

It’s not surprising that automatic transcription services would struggle with some accents, says Joseph Fridman, who researches the methods of science communication, including issues relating to the politics and practice of transcription, at the Interdisciplinary Affective Science Lab at Northeastern University/Massachusetts General Hospital. The computer algorithms that power those services develop a bias depending on the training audio used to create them, he says, and they impose that bias on what they process. “The machine brain eats certain types of language and makes everyone else conform to it,” Fridman explains.

That’s not just an inconvenience to journalists, Fridman cautions—it’s also a potential threat to diversity and inclusion in sourcing, because reporters might subconsciously select sources who are perceived as responding in Standard American English in order to ensure a cleaner transcription. He urges journalists to guard against this tendency to internalize the biases that automated transcription might encourage against speakers with accents. “Whenever anything says it’s going to make something easier for us, it’s very tempting to stop thinking critically about it.”

Comparing Transcription Services’ Accuracy

We sent two audio samples, each 15 minutes long, to four different automated transcription services as well as Rev’s human-based transcription service. Transcription errors are in red. Although the different transcription services made different mistakes, based on this unscientific comparison, the rate of accuracy was similar.

Sample 1: Male geneticist in China who is a non-native English speaker talking over a phone line

Actual

I mean, so, it’s interesting. It’s sort of idle thinking, that, you say, “Well, you go back there, to Cambrian Explosion.” You cannot go back. But cancer is an evolution process that reiterates itself endlessly. And so we really have a system in cancer biology we can test this principle of reproducibility. And the TCGA data show precisely what evolutionist would have expected. But the process of evolution is not reproducible, and since cancer evolves… the people who… the medical community… medical doctors are the least receptive to the idea of evolution.

Rev

I mean, it’s interesting. It’s full of idle thinking, that, you say, “Wow let’s go back there to [inaudible 00:10:06]”. But cancer is an evolution process that reiterates itself endlessly. And so we really have a system in place in biology we can test this principle of reproducibility. And the TCG are so precisely what the evolutionist would have expected but the path of evolution is not reproducible, and since cancer evolved the people who, the medical community, medical doctors are the least receptive to the idea of evolution.

Temi

I mean it’s interesting. It’s sort of, I go thinking that you say, well you go back to kindred and explosion then go back. So but process correct itself endlessly. And so we really have a system in cancer biology. We can test this principle of reproducibility and the TCGA data show precisely what evolution is and would have had expected that the process of evolution is not reproducible. Since cancer evolved, the people who the medical communities, medical doctors are the least. We’ve accepted, uh, to the idea of evolution.

Trint

I mean it’s interesting. It’s sort of idle thinking that you say well you go back to Canberra explosion. You cannot go back. So. But cancer is an evolution process that will correct itself endlessly. And so we really have a system intensive biology. We can test this principle of reproducibility in the pieces you can show precisely what evolution is I would have expected that the process of evolution is not reproducible since Kensington rob the people. Who. The medical community never come in medical practice are the least receptive to the idea of evolution.

Otter

I mean, be so interesting, take this, thinking that you say, well, you go back in Cambrian explosion is gonna go back. So cancer is an evolution process that correct itself endlessly. And so we really have a system intensive biology, we can test this principle of reproducibility. And the TCG data show precisely what evolution is would have expected, the process of evolution is not reproducible. [Missing sentence] People, medical community, medical doctors, to the idea of evolution.

Sonix

[Missing sentence] It’s sort of idle thinking that you say, well, you go back to Canberra explosion. You cannot go back. So. But cancer is an evolution process that will correct itself endlessly. And so we really have a system, intensive biology. We can test this principle of reproducibility in the pieces you think are. Precisely what evolutionists would have expected that the process of evolution is not reproducible since Kensington Rob, the people who the medical community never come in medical practice are the least receptive to the idea of evolution.

Sample 2: Female endocrinologist recorded in person speaking in a British accent

Actual

The starting point is very much the brain and the fact that the brain integrates signals from the periphery about nutrient stores, including the hormone Leptin, but also other nutritional signals and also signals that relate to meal, start and the end of meals, for satiety. But then when we focus … zoom in on the hypothalamus, we know from studies in animals that there are these key specialized nuclei or regions in the hypothalamus.

Rev

Temi

The starting point is very much the brain, uh, and the fact that the brain integrates signals from the periphery about nutrient stores, including the Hormone Leptin, uh, but also other nutritional signals and also signals that relate to meal starting the end of meals. Yeah. For satiety. Um, but then when we focus zoom in on the Hypothalamus, we know from studies in animals that there are these key specialized nuclei or regions in the hypothalamus.

Trint

The starting point is very much the brain and the fact that the brain integrates signals from the periphery about nutrient stores including the hormone leptin but also other nutritional signals and also signals that relate to meal start and end of meal for satiety. But then when we focus zoom in on the hypothalamus we know from studies in animals that there are these key specialized nuclei or regions in the hypothalamus.

Otter

Starting point is very much the brain. And the fact that the brain integrate signals from the periphery about nutrients stores, including the hormone leptin, but also other nutritional signals, and also signals that relate to meal, starting the end of meals for society. But then, when we first zoom in on the hypothalamus, we know from studies and animals that there are these key specialized nuclei regions in the hypothalamus.

Sonix

The starting point is very much the brain and the fact that the brain integrates signals from the periphery about nutrient stores, including the hormone leptin, but also other nutritional signals and also signals that relate to meal start and end of meal for satiety. But then when we focus, zoom in on the hypothalamus, we know from studies in animals that there are these key specialized nuclei or regions in the hypothalamus.

* Correction 12/13/19: In an earlier version of this story, Charles Piller was incorrectly quoted as describing himself as an “investigational” reporter when in fact he had described himself as an “investigative” reporter. This was noted correctly in the reporter’s notebook, but an error was introduced in typing the quote from the audio file, proving that humans—just like machines—are fallible.

Roxanne Khamsi is a science writer whose work has appeared in publications such as Scientific American, Wired, The Economist and The New York Times Magazine. She formerly oversaw news coverage at Nature Medicine. Follow her on Twitter at @rkhamsi.

Transcription Service	Cost
Rev (human transcription)	$1/min
Temi	$0.10/min
Otter	$9.99/month for 100 hours (Free plan offers first 10 hours a month for free)
Sonix	$10/hour
Trint	$15/hour (Price decreases with larger package plans)