tools - 语音识别技术 - 《不积跬步无以至千里(技术)》

著作权归作者所有。
商业转载请联系作者获得授权，非商业转载请注明出处。
作者：Bill Xia
链接：http://www.zhihu.com/question/21505605/answer/19031836
来源：知乎

语音识别开发平台有很多，具体总结如下：（更详细的介绍参见本人的博文：几个常见的语音交互平台的简介和比较）
1.商业化的语音交互平台
1)微软Speech API

微软的Speech API（简称为SAPI）是微软推出的包含语音识别（SR）和语音合成（SS）引擎的应用编程接口（API），在Windows下应用广泛。目前，微软已发布了多个SAPI版本（最新的是SAPI 5.4版），这些版本要么作为于Speech SDK开发包发布，要么直接被包含在windows 操作系统中发布。SAPI支持多种语言的识别和朗读，包括英文、中文、日文等。
2).IBM viaVoice
IBM是较早开始语音识别方面的研究的机构之一，早在20世纪50年代末期，IBM就开始了语音识别的研究，计算机被设计用来检测特定的语言模式并得出声音和它对应的文字之间的统计相关性。1999年，IBM发布了VoiceType的一个免费版。2003年，IBM授权ScanSoft公司拥有基于ViaVoice的桌面产品的全球独家经销权，随后ScanSoft与Nuance合并，如今viaVoice早已淡出人们的视线，取而代之的是Nuance。
3）Nuance

Nuance通讯是一家跨国计算机软件技术公司，总部设在美国马萨诸塞州伯灵顿，主要提供语音和图像方面的解决方案和应用。目前的业务集中在服务器和嵌入式语音识别，电话转向系统，自动电话目录服务等。Nuance语音技术除了语音识别技术外，还包扩语音合成、声纹识别等技术。世界语音技术市场，有超过80%的语音识别是采用Nuance识别引擎技术，其名下有超过1000个专利技术，公司研发的语音产品可以支持超过50种语言，在全球拥有超过20亿用户。苹果的iPhone 4S的Siri语音识别中就应用了Nuance的语音识别服务。
4）科大讯飞

科大讯飞作为中国最大的智能语音技术提供商，在智能语音技术领域有着长期的研究积累，并在中文语音合成、语音识别、口语评测等多项技术上拥有国际领先的成果。占有中文语音技术市场60%以上市场份额，语音合成产品市场份额达到70%以上。
5）其他

其他的影响力较大商用语音交互平台有谷歌的语音搜索（Google Voice Search），百度和搜狗的语音输入法等等。
2.开源的语音交互平台

1）CMU-Sphinx

CMU-Sphinx也简称为Sphinx（狮身人面像），是卡内基 - 梅隆大学（ Carnegie Mellon University，CMU）开发的一款开源的语音识别系统，它包括一系列的语音识别器和声学模型训练工具。最早的Sphinx-I 由@李开复（Kai-Fu Lee）于1987年左右开发，使用了固定的HMM模型（含3个大小为256的codebook），它被号称为第一个高性能的连续语音识别系统（在Resource Management数据库上准确率达到了90%+）。最新的Sphinx语音识别系统包含如下软件包：
§ Pocketsphinx - recognizer library written in C.
§ Sphinxbase - support library required by Pocketsphinx
§ Sphinx4 - adjustable, modifiable recognizer written in Java
§ CMUclmtk - language model tools
§ Sphinxtrain - acoustic model training tools
这些软件包的可执行文件和源代码在sourceforge上都可以免费下载得到。
2）HTK
HTK是Hidden Markov Model Toolkit（隐马尔科夫模型工具包）的简称，HTK主要用于语音识别研究，最初是由剑桥大学工程学院（Cambridge University Engineering Department ，CUED）的机器智能实验室（前语音视觉及机器人组）于1989年开发的，它被用来构建CUED的大词汇量的语音识别系统。HTK的最新版本是09年发布的3.4.1版，关于HTK的实现原理和各个工具的使用方法可以参看HTK的文档HTKBook。
3）Julius
Julius是一个高性能、双通道的大词汇量连续语音识别（large vocabulary continues speech recognition，LVCSR）的开源项目，适合于广大的研究人员和开发人员。它使用3-gram及上下文相关的HMM，在当前的PC机上能够实现实时的语音识别，单词量达到60k个。
4）RWTH ASR
该工具箱包含最新的自动语音识别技术的算法实现，它由 RWTH Aachen 大学的Human Language Technology and Pattern Recognition Group 开发。RWTH ASR工具箱包括声学模型的构建、解析器等重要部分，还包括说话人自适应组件、说话人自适应训练组件、非监督训练组件、个性化训练和单词词根处理组件等。
5）其他
上面提到的开源工具箱主要都是用于语音识别的，其他的开源语音识别项目还有Kaldi 、simon 、iATROS-speech 、SHoUT 、 Zanzibar OpenIVR 等。

Sphinx-4 Application Programmer’s Guide
WARNING: THIS TUTORIAL DESCRIBES SPHINX4 API FROM THE PRE-ALPHA RELEASE:
https://sourceforge.net/projects/cmusphinx/files/sphinx4/5prealpha/
The API described here is not supported in earlier versions
Overview
Sphinx4 is a pure Java speech recognition library. It provides a quick and easy API to convert the speech recordings into text with the help CMUSphinx acoustic models. It can be used on servers and in desktop applications. Beside speech recognition Sphinx4 helps to identify speakers, adapt models, align existing transcription to audio for timestamping and more.
Sphinx4 supports US English and many other languages.
Using in your projects
As any library in Java all you need to do to use sphinx4 is to add jars into dependencies of your project and then you can write code using the API.
The easiest way to use modern sphinx4 is to use modern build tools like Apache Maven or Gradle. Sphinx-4 is available as a maven package inSonatype OSS repository. To use sphinx4 in your maven project specify this repository in your pom.xml:

…

snapshots-repo
https://oss.sonatype.org/content/repositories/snapshots
false
true

…

Then add sphinx4-core to the project dependencies:

edu.cmu.sphinx
sphinx4-core
1.0-SNAPSHOT

Add sphinx4-data to dependencies as well if you want to use default acoustic and language models:

edu.cmu.sphinx
sphinx4-data
1.0-SNAPSHOT

Many IDEs like Eclipse or Netbeans or Idea have support for Maven either with plugin or with built-in features. In that case you can just include sphinx4 libraries into your project with the help of IDE. Please check the relevant part of your IDE documentation, for example IDEA documentation on Maven.
You can also use Sphinx4 in non-maven project, in that case you need to download jars from the repository manually together with dependencies (which we try to keep small) and include them into your project. You need sphinx4-core jar and sphinx4-data jar if you are going to use US English acoustic model. See below
Sphinx4 jar download
Basic Usage
There are several high-level recognition interfaces in Sphinx-4:
LiveSpeechRecognizer
StreamSpeechRecognizer
SpeechAligner
For the most of the speech recognition jobs high-levels interfaces should be enough. And basically you will have only to setup four attributes:
Acoustic model.
Dictionary.
Grammar/Language model.
Source of speech.
First three attributes are setup using Configuration object which is passed then to a recognizer. The way to point out to the speech source depends on a concrete recognizer and usually is passed as a method parameter.
Configuration
Configuration is used to supply required and optional attributes to recognizer.
Configuration configuration = new Configuration();

// Set path to acoustic model.
configuration.setAcousticModelPath(“resource:/edu/cmu/sphinx/models/en-us/en-us”);
// Set path to dictionary.
configuration.setDictionaryPath(“resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict”);
// Set language model.
configuration.setLanguageModelPath(“resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin”);
LiveSpeechRecognizer
LiveSpeechRecognizer uses microphone as the speech source.
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
// Start recognition process pruning previously cached data.
recognizer.startRecognition(true);
SpeechResult result = recognizer.getResult();
// Pause recognition process. It can be resumed then with startRecognition(false).
recognizer.stopRecognition();
StreamSpeechRecognizer
StreamSpeechRecognizer uses InputStream as the speech source, you can pass the data from the file this way, you can pass the data from the network socket or from existing byte array.
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
recognizer.startRecognition(new FileInputStream(“speech.wav”));
SpeechResult result = recognizer.getResult();
recognizer.stopRecognition();
Please note that the audio for this decoding must have one of the two specific format:
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
or
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz
Decoder does not support other formats. If audio format does not match, you will not get any results. You need to convert audio to a proper format before decoding. If you want to decode telephone quality audio with the sample rate 8000 Hz, you also need to call
configuration.setSampleRate(8000);
You can retreive multiple results until the file end:
while ((result = recognizer.getResult()) != null) {
System.out.println(result.getHypothesis());
}
SpeechAligner
SpeechAligner time-aligns text with audio speech.
SpeechAligner aligner = new SpeechAligner(configuration);
recognizer.align(new URL(“101-42.wav”), “one oh one four two”);
SpeechResult
SpeechResult provides access to various parts of the recognition result, such as recognized utterance, list of words with time stamps, recognition lattice and so forth.
// Print utterance string without filler words.
System.out.println(result.getHypothesis());

// Get individual words and their times.
for (WordResult r : result.getWords()) {
System.out.println(r);
}

// Save lattice in a graphviz format.
result.getLattice().dumpDot(“lattice.dot”, “lattice”);
Demos
A number of sample demos are included in sphinx4 sources in order to give you understanding how to run Sphinx4. You can run them from sphinx4-samples jar:
Transcriber - demonstrates how to transcribe a file
Dialog - demonstrates how to lead dialog with a user
SpeakerID - speaker identification
Aligner - demonstration of audio to transcription timestamping
If you are going to start with a demo please do not modify the demo inside sphinx4 sources, instead copy the code into your project and modify it there.
Building from source
If you want to develop Sphinx4 itself you might want to build it from source. Sphinx4 uses Maven build system. Simply type ‘mvn install’ in the top folder, it will compile and install everything including dependencies.
If you are going to use IDE, make sure it supports Maven project, then simply import sphinx4 source tree as Maven project.
Troubleshooting
You might meet different problems while using sphinx4, please check the FAQ first before asking them on the forum.
In case you have issues with accuracy you need to provide audio recording you are trying to recognize, all the models you use and describes how results you see are different from your expectation.

sphinx系统是一个拥有悠久历史的语音识别系统，
传说中是第一个实用的10数字语音系统。
是由卡奈基.美隆大学研发。
sphinx3.x是基于C语言的最新版本,sphinx和sphinx2请大家不要去研究了。
sphinx for ppc是一个在PocketPc上实现的嵌入式语音识别系统。
而sphinx4是完全用JAVA编写实现的语音识别系统，
因为JAVA的特性，在平台间移植方便很多。
需要注意的是sphinx3和sphinx4不是先后的关系，而是平行的，
主要区别是前一个用C语言实现，后一个用JAVA语言实现。
因为种种原因我研究sphinx4有一个月时间了，
根据自己需要阅读修改了FrontEnd部分源代码，
1.sphinx4主页是
http://cmusphinx.sourceforge.net/sphinx4/
请把它加入收藏，谢谢。
上面有下载的连接。
现在最新的版本应该是这个sphinx4-1.0beta-bin.zip
解压到e:\sphinx4
(我以这个目录为例，以后的讲述都是假设这个为根目录，
而且以”sphinx4>”这个代替命令行下的这个目录 )
给PC接上麦克风(Mic),cmd进入命令行
运行
sphinx4> java -jar bin/HelloDigits.jar
这是一个识别单个数字的DEMO程序
2.如果你还没能用那个识别数字的DEMO
检查了一下自己英语数字的发音。
那么说明你环境有问题，请检查以下事项
a.是否有JAVA运行环境，在命令行下敲 java
若没有，这样的一些演示
“Usage: java [-options] class [args… “
说明JAVA运行环境有问题，
到http://java.sun.com/j2se/1.4.2/download.html
点连接”Download J2SE JRE “下载并安装JAVA运行环境。
b.若可以启动，但对着麦克风说话，DEMO没能正确识别出来。
那么请检查麦克风是否接好，在QQ，或MSN，或SKYPE中语聊一会
确认自己的麦克风连接没有问题。
如果你是用的LINUX系统或其他类UNIX系统，
那么需要修改配置文件，才能正确开始。
3. 如果你已经成功运行DEMO，那么你已经对SPHINX4有了感性认识。
下面来深入一些，了解从语音到识别的这样一个过程。
附件是识别器的示意图。
下面简单说一下各模块的用处
FrontEnd:
前端处理模块，把语音转换成特征就是由这个模块完成。
FrontEnd经过简单的配置后可以
将Wav文件、麦克风甚至倒谱(cepstrum)文件做为输入。
Decoder:
解码器，将搜索语言模型，找出与特征相对应的识别项（一般是音素）
Linguist：
金山给的翻译是语言学家，这样称为语言模型更贴切。
它包括三个部分，
AcousticModel:声学模型，建立输入声音和音素的关系
Dictionary:字典，可以接受的音素范围。
LanguageModel:语言模型，建立字与字之间先后语言关系。
这三个模型是预先建立好的，我会在模型创建那段去
仔细讲这三个模型。
更详细的信息从下面的白皮书上可以查到
http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4Whitepaper.pdf
语音识别系统 Sphinx-4 介绍
项目主页：http://www.open-open.com/lib/view/home/1324807508858
转自：http://www.xiruibin.com/archives/57