DeepSpeech2是由Baidu AI LaB于2015年发布的语音识别模型
DeepSpeech2和初版DeepSpeech一样是端到端的模型,新论文相较于初版改进主要有

  • 修改了模型,引入卷积层和BatchNorm,以及新设计的Lookahead Convolution
  • 增加了对中文语音识别的讨论
  • 基于HPC的多GPU训练加速

论文链接:

DeepSpeech—— Deep Speech - Scaling up end-to-end speech recognition
DeepSpeech2—— Deep Speech 2 - End-to-End Speech Recognition in English and Mandarin

模型结构

DeepSpeech2结构如下图所示

deepspeech2

模型输入是频谱图(spectrogram),接下来先是多层1D或2D卷积层,论文实验结果表明1D卷积的效果不大,2D卷积在noisy data上效果明显,但在clean data上效果不大

然后是多个(双向)RNN层,论文实验表明在Batchnorm的帮助下普通RNN(Vanilla RNN)效果也很好

最后就是经典的CTC Loss

模型的最终损失还加入了辅助语言模型,设$\mathrm{wc}(·)$表示单词/字符个数,最终loss表达式为

Batchnorm应用于RNN的改进

如果只是简单的将BatchNorm添加在RNN之后,虽然收敛速度会加快,但是对于减小误差没有帮助,所以论文提出了两种将BathNorm应用于RNN的变体

一个RNN单元可以表达为下式

第一种方法是在非线性激活前应用batchnorm,表达式为

第二种方法是只在垂直方向上应用batchnorm,表达式为

Lookahead Convolution

双向RNN模型很难应用于有低延迟需求的环境,然而只使用单向RNN又会造成识别效果下降,因此论文提出了Lookahead Convolution层

lookahead_conv

如图所示,lookahead convolution应用在每个单向RNN层之后,表达式为

其中$W\in \mathbb{R}^{(d,\tau)}$​​为权重矩阵
即lookahead conv层可以将未来时刻的RNN输出以线性方式组合,以此替代反向RNN

模型训练

SortaGrad

为了解决CTC Loss不稳定的问题,论文提出了SortaGrad

CTC often ends up assigning near-zero probability to very long transcriptions making gradient descent quite volatile

这句话不太好翻译,意思就是太长的句子不利于CTC训练

因此SortaGrad的策略就是先训练较短的句子,实际操作中是第一个epoch以升序排列训练集中的句子,后续epoch则随机排序

针对中文数据集的调整

针对中文语音识别的训练调整主要只有两点

  • 使用字符级语言模型,词汇表大小约为6000
  • 在词汇表中加入罗马字母

极少量的调整表明一种语言的模型可以很好的迁移到另一种语言

构造数据集

论文使用了CTC训练的RNN对原始长段语音数据进行分段
这部分论文没有详细描述,而且现成的数据集现在已经很多了,有兴趣可以自己看论文怎么说的

代码实现

只是一个简易版本的实现,lookahead conv没有写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
""" Deep Speech2
paper: Deep Speech 2 - End-to-End Speech Recognition in English and Mandarin
see: http://proceedings.mlr.press/v48/amodei16.pdf

see also: Deep Speech - Scaling up end-to-end speech recognition
https://arxiv.org/pdf/1412.5567.pdf
"""

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Bidirectional, GRU, BatchNormalization, Conv2D, Activation, Reshape
from tensorflow.keras.utils import Sequence
import tensorflow.keras.backend as K
import numpy as np
import math
from keras_tf.ASR.preprocessor import LJSpeechPreprocessor


class Dataloader(Sequence):
"""dataloader for CTC model"""
def __init__(self, wavs_list, target_sequence, n_mels, batch_size=64):
self.wavs_list = wavs_list
self.targets = target_sequence

self.n_mels = n_mels
self.batch_size = batch_size

self.fnum = len(wavs_list)

def __len__(self):
return math.ceil(self.fnum / self.batch_size)

def __getitem__(self, idx):
st = idx * self.batch_size
ed = min((idx + 1) * self.batch_size, self.fnum)

targets = self.targets[st:ed, :] # shape (samples, length)
inputs = LJSpeechPreprocessor.getSpectrograms(
self.wavs_list[st:ed], self.n_mels
) # shape (samples, mxlen, n_mels)

return inputs, targets


class DeepSpeech2:
def __init__(self):
preprocessor = LJSpeechPreprocessor('D:\\wallpaper\\datas\\LJSpeech-1.1', num_samples=None)

self.wavs_list = preprocessor.getWavsList()
self.orginal_text = preprocessor.getOriginalText()
self.target_seq, self.vocab, self.vocab_rev = preprocessor.getTargetSequence()
self.vocab_size = len(self.vocab.keys())

self.latent_dim = 128

self.model = self.buildNet()

def buildNet(self):
inputs = Input(shape=(None, self.latent_dim))
x = Reshape((-1, self.latent_dim, 1))(inputs)

x = Conv2D(32, kernel_size=[7, 11], strides=[1, 1], padding='same')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)

x = Conv2D(32, kernel_size=[7, 11], strides=[1, 2], padding='same')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)

x = Reshape((-1, x.shape[-2] * x.shape[-1]))(x)

x = Bidirectional(GRU(512, return_sequences=True), merge_mode='sum')(x)
x = Bidirectional(GRU(512, return_sequences=True), merge_mode='sum')(x)
x = Bidirectional(GRU(512, return_sequences=True), merge_mode='sum')(x)
x = BatchNormalization()(x)

x = Dense(256, activation='relu')(x)
prob = Dense(self.vocab_size + 1, activation='softmax')(x)

model = Model(inputs, prob)
model.compile(optimizer='adam', loss=self.CTCLoss)
return model

def CTCLoss(self, y_true, y_pred):
batch_size = tf.shape(y_true)[0]
pred_length = tf.shape(y_pred)[1]
label_length = tf.shape(y_true)[1]

pred_length = pred_length * tf.ones(shape=(batch_size, 1), dtype="int32")
label_length = label_length * tf.ones(shape=(batch_size, 1), dtype="int32")

loss = K.ctc_batch_cost(y_true, y_pred, pred_length, label_length)
return loss

def trainModel(self, epochs, batch_size=64):
dataloader = Dataloader(
self.wavs_list, self.target_seq, self.latent_dim,
batch_size=batch_size
)

self.model.fit(dataloader, epochs=epochs)
self.test()

def test(self):
for i in range(5):
inputs = LJSpeechPreprocessor.getSpectrograms(
self.wavs_list[i:i + 1], self.latent_dim
)

res = self.recognize(inputs[0:1])
print('-')
print('Decoded Sentence:', res)
print('Ground Truth:', self.orginal_text[i])

def recognize(self, spect):
pred = self.model.predict(spect)
input_len = np.ones(pred.shape[0]) * pred.shape[1]
decode = K.ctc_decode(pred, input_length=input_len, greedy=True)[0][0]
output = K.get_value(decode)

res = ''
for x in output[0]:
if x == -1 or x == 0:
continue
res += self.vocab_rev[x]

return res


speechRecognizer = DeepSpeech2()
speechRecognizer.trainModel(epochs=20, batch_size=8)

数据使用LJSpeech,预处理如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas as pd
import numpy as np
import librosa
import os

class LJSpeechPreprocessor():
""" load and preprocess the LJSpeech-1.1 dataset
see: https://keithito.com/LJ-Speech-Dataset/
"""

def __init__(self, dataDir, num_samples=None):
self.dataDir = dataDir
self.metadata = self.readMetadata(num_samples)

def readMetadata(self, num_samples):
"""Read meta data"""
fpath = os.path.join(self.dataDir, 'metadata.csv')
metadata = pd.read_csv(fpath, sep='|', header=None, quoting=3)
metadata.columns = ['ID', 'Transcription', 'Normalized Transcription']
metadata = metadata[['ID', 'Normalized Transcription']] # we only need normalized transcription
# metadata = metadata.sample(frac=1.0).reset_index(drop=True) # shuffle

if num_samples:
metadata = metadata[:min(num_samples, metadata.shape[0])]

return metadata

def getWavsList(self):
"""get list of file path of .wav data"""
wav_dir = os.path.join(self.dataDir, 'wavs')
wavs_list = [os.path.join(wav_dir, fname + '.wav') for fname in self.metadata['ID']]
return wavs_list

def getOriginalText(self):
"""get original sentences"""
return self.metadata['Normalized Transcription'].tolist()

def getTargetSequence(self, SOS='', EOS=''):
"""get tokenized and indexed sentences """
target_text = [SOS + txt + EOS for txt in self.metadata['Normalized Transcription']]

tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(target_text)

target_seq = tokenizer.texts_to_sequences(target_text)
target_seq = pad_sequences(target_seq, padding='post')

vocab = tokenizer.word_index
vocab['<unk>'] = 0

vocab_rev = dict((id, char) for char, id in vocab.items())

return target_seq, vocab, vocab_rev

@staticmethod
def getSpectrograms(wavs_list, n_mels, norm=True):
"""get the spectrogram corresponding to each audio"""
spectrograms = []
for fpath in wavs_list:
wav, sr = librosa.load(fpath, sr=None)
spect = librosa.feature.melspectrogram(wav, sr, n_fft=1024, n_mels=n_mels)
spect = np.transpose(spect)
if norm:
mean = np.mean(spect, 1).reshape((-1, 1))
std = np.std(spect, 1).reshape((-1, 1))
spect = (spect - mean) / std
spectrograms.append(spect)

spectrograms = pad_sequences(spectrograms, padding='post')
return spectrograms