Recurrent neural networks (RNN) for speech detection
RNN or recurrent neural network is all about connections along the temporal sequence. It is a class of artificial neural networks that permits the prior outputs to be used as inputs when it has the hidden states. Here, we will discuss how a recurrent neural network is used for speech detection.
Minimization of FER
You need to compare the performance of the different SAD criteria extraction techniques on a FER 2 minimization task. Check the quality of the optimization method by comparing the results obtained with the performance of a baseline DAS system. You can optimize different DSS systems using the functions of cost corresponding to the FER, i.e. (C1, C′1) with a coefficient α = 0.5 (cf. p. 60). The results obtained with and without a smoothing function (that is to say, no addition of buffers, or suppression of segments or short rests) are detailed in table 1.1 and in figure Fig. 1.1, where you can analyze the distribution of errors calculated per test file for each system. Table 1.1 - Performance is measured by FER with and without smoothing function (that is, no addition of buffers, no removal of segments, or short rests). RNN models are less dependent on the smoothing function than other models, and the CG-LSTM model provides the best performance.
Table: 1.1
You can thus see that the use of the optimization process on targeted data results in much better performance than with a system generalist such as gmmSAD. It can also be noted that ANN-based systems are significantly more efficient than systems based on more traditional SAD criteria (gain of at least 17%). And among neural techniques, RNNs show themselves particularly suitable since, by allowing the temporal context to be used freely, these models are able to also learn a large part of the function smoothing to minimize the FER, which is not the case with MLP. Finally, you can find that the best performing recurring network is the model and which improves the average FER by 2% compared to the standard LSTM model. So you can compare the differences in performance between DSS systems based on CGLSTMs having as input the time signal, the complete log-spectrogram, or the MFCCs with a number of identical neurons. The performance differences are relatively small. These small performance gaps demonstrate the great versatility of CG-LSTMs that can be successfully trained directly on the raw signal despite the very high information redundancy of this signal and the need to exploit a broad temporal context since the audio signal in the framework of the program is sampled at 8kHz or 8000 points per second.
Figure 1.1 - Distributions of the FER calculated by file for the different DSS systems with and without smoothing. RNNs are able to largely learn the smoothing function necessary to minimize the FER. The DSS system producing the highest error rates bottom (leftmost distribution on the graph) is obtained using the CG-LSTM model.
However, it is important to note the difference in processing time, depending on the choice of representation of the entries. Thus, considering the performance as well in processing time than in SAD performance, the MFCCs appear as being the best choice of representation. Table 1.2 -
Table 1.2
You can also study the interest of each of the phases of process optimization to validate the choice of an alternate QPSO use strategy and of SMORMS3 to optimize a DSS system using a neural network. The table Tab. 3.3 presents the results obtained with different optimization processes using either QPSO or SMORMS3 alone or a combination of both. You call back that SMORMS3 is only used to optimize the weights of the neural network and that QPSO is only used to optimize the parameters of the decision function and to smooth when used after SMORMS3. For this test, the DSS system uses the CG-LSTM model to estimate the SAD criterion.
Table 1.3
You can see that the use alone of one or the other of the algorithms already gives very good results, but the alternate use of the two methods can significantly improve performance. In particular, use QPSO to optimize parameters of the MFCC extraction chain then allow the best started with gradient descent optimization. Likewise, you can still win a little by using QPSO after the gradient descent to readjust the parameters of the decision and smoothing functions.
Minimization of WER
The primary goal is being to minimize the WER of a PAR system; you can also determine the values of the coefficient α, making it possible to minimize the WER when uses the cost functions C1 and C2. Thus the table Tab. 3.4 presents the actual WER obtained by the PAR system when segmenting the development game files with a DSS system (CG-LSTM) optimized using the cost function C1 (respectively C2) for different values of the coefficient α.
Table 3.4 - Impact of the coefficient α on the real WER obtained by the RAP system when the development game files are segmented with an optimized DSS system using the cost function C1 or C2. The real WER obtained using the cost function C3, which has no hyperparameter, is shown for comparison.
Table 3.4
You can see that the two cost functions do not behave at all in the same way. For C1, the value of the coefficient α, which makes it possible to obtain the best error rate, is 0.2, which is mainly explained by the fact that the RAP inserts a lot of words, and it is, therefore, better to reduce the risk of false alarms even if it means losing a little speech signal. On the contrary, in the case of cost function C2, the segmentation references take into account that the signal causing insertions should be considered noise, so it is a lot more important to minimize the risk of losing the speech signal even if it means some false alarms. This is exactly what we observe since the coefficient α, which gives the best WER, is 0.95. Finally, the cost function C3 makes it possible to obtain a better WER on the development set than the other two cost functions regardless of the value of the coefficient α chosen. The advantage of using a function of cost such as C3 which takes into account the behavior of the PAR system which you want to improve performance and which is as close as possible to the metric aimed. You can also verify the correct correspondence of the cost function C3 with the actual WER obtained on the development game: for each system is mentioned in the table Tab. 3.4 By performing linear regression, one obtains a coefficient of determination of 0.98, which shows the good ability to predict WER by the cost function of imitation -WER C3 (see also the figure Fig. 1.2).
Figure 1.2 - Correspondence between the imitation -WER and the real WER calculated. The coefficient of determination of the regression linear is 0.98, which shows the good prediction capacity of the WER by the function of cost C3. Once the optimal value of the coefficient α was chosen, you can optimize the different DSS systems with the three cost functions then measure the performance obtained on the test set. The table Tab. 3.5 allows us to compare the WER obtained on the test set for all SAD systems. They are also compared to the reference DAS system gmmSAD, which was also used to extract the speech segments to train the PAR system.
You will see that the optimization of DSS systems on representative data with cost functions that come as close as possible to the metric of interest (choice of coefficient α for C1 and C2, and C3 by construction) allows to always reduce the WER when compared to the WER obtained with a segmentation generalist. You will also note that the more the cost function is similar to the metric target, the greater the gain, regardless of the DSS system considered. The cost function C3, which does not have any hyper-parameter to choose/adjust, is required as the cost function of choice for optimizing a DSS system so as to minimize the WER.
On the other hand, as you will observe in the first experiments, the model CG-LSTM was found to be the most efficient of all the DSS systems tested. Indeed, whatever the cost function is used, it is with this model that you obtain the WER the weakest. By coupling C3 with the CG-LSTM model, you will finally obtain a relative gain of 4.5% compared to the WER obtained with the gmmSAD DSS system that was used during the training of the PAR system. Table 3.5 - Impact of the cost function on the WER obtained with the different systems of SAD. The cost function C3 allows achieving the best performance regardless of the DSS system used. And the CGLSTM model achieves the best performance regardless of the cost function used. By coupling the two, you will obtain a relative gain of 4.5% on the WER compared to the general system practitioner used to train the RAP system.
Table 3.5
As shown in table Tab. 3.6, the reduction of the error rate essentially comes from a sharp reduction in the number of insertions (from 6.4% at 3.6%) between the gmmSAD reference system and the CG-LSTM-based system optimized with the cost function C3. Looking in detail at the signal segments that generated insertions but which were rejected as not being a speech by the CG-LSTM-based system, this model is capable of learning to distinguish speech that generates errors from a speech that does not. Table 3.6 - Detailed results of the best tuning for each of the DSS systems. The gains on the WER come mainly from a reduction in the number of insertions.
Table 3.6
As in the case of the FER minimization, you can study the impact of each of the phases of the optimization process on the final metric so as to verify the interest of a strategy of alternate use of QPSO and SMORMS3. The table Tab. 3.7 presents the WER obtained with different optimization processes using either QPSO or SMORMS3 alone or a combination of both. As for the minimization of FER, you can observe that it is interesting to use QPSO to optimize the parameters of the MFCC extraction chain before optimizing the RNN parameters using gradient descent. Likewise, you can further decrease the WER by using QPSO after the gradient descent to readjust the parameters of the decision and smoothing functions. Table 3.7 - Segmentation performance (WER) when you change the optimization process. For this test, the SAD system uses the model CG-LSTM to estimate the criterion of SAD.
Table 3.7
Table 3.8
You will observe that whatever the SAD system used, you manage to reduce the WER of a RAP system when compared to the WER obtained with a system generalist DBMS, even if it was used to build the RAP system. Among the different DSS tested techniques, the hierarchy of systems is the same as that observed, and the CG-LSTM model remains the most efficient and enables a gain of more than 1 WER point on each of the languages processed.
Author: Vicki Lezama