Microsoft claimed to reach a 5.9 percent word error rate last October using neural language models resembling associative word clouds. At the time, the company believed 5.9 percent was equivalent to human parity. But, IBM says it’s not popping the champagne yet. “As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent,” George Saon, IBM principal research scientist, wrote in a blog post this week.
IBM reached the 5.5 percent milestone by combining so-called Long Short-Term Memory, an artificial neural network, and WaveNet language models with three strong acoustic models. It was then measured using the “SWITCHBOARD” corpus, a collection of telephone conversations that’s been used as a benchmark for speech recognition software for decades. SWITCHBOARD is not the industry standard for measuring human parity, however, which makes breakthroughs harder to achieve.
“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex,” said Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, in a statement to IBM. “It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others.”


