2014-01-25

The previous post of this series tried to present argument, why it is a good idea to consider implementing part of your code in RenderScript, instead of Java or native code. This part presents a benchmark that we will use to evaluate the technology. I intentionally chose the benchmark not to be an image manipulation algorithm or a convolution filter. I wanted a use case that processes large amount of data. Sound processing yields itself as a quite evident choice.

The algorithm I am going to present is quite naive and would require a lot of refinement for commercial deployment. But my goal was RenderScript benchmarking so the algorithm's tolerance to all kinds of environmental disturbances was not relevant. We will do voice fingerprinting that takes a voice sample array, compares this sample to a reference sample and returns true or false depending on whether the voice sample is sufficiently similar to the reference sample. The algorithm should return true if the voice sample contains the same word from the same speaker, false otherwise.

I implemented Dynamic Time-Warping (DTW) for the purpose. DTW compares two data series assuming that the timing between the two series may be different. For example the same speaker may say the same word with slightly higher pitch. Therefore DTW calculates a so called warping path, a pair of corresponding timestamps from series1 and series2. E.g. a part of a warping path may be (1,1), (1,2),(2,3),(3,3) meaning that a[1] (data at index #1 from series1) corresponds to b[1], a[1] corresponds to b[2], a[2] corresponds to b[3] and a[3] corresponds to b[3]. Knowing the warping path and a cost function, the distance between the two series can be calculated. DTW finds the warping path with the lowest cumulative cost (i.e. the sum of cost between the corresponding data items along the warping path). A detailed introduction to the algorithm can be found here.

This example program implements classic DTW calculating every possible pairing between the data items of the two series. This requires an NxM matrix where N is the size of series1 and M is the size of series2. DTW(i,n) (element at position (i,n) in this matrix) is the accumulated cost value (sum of the cost function values) of the optimal warping path that leads through the position at i,n. (i.e. with assuming that index i in series1 and index n in series2 correspond to each other). The function calculating DTW(i,n) is thus:

DTW(i,n)=min(DTW(i-1,n),DTW(i-1,n-1),DTW(i,n-1))+c(i,n)

where c(i,n) is the cost between a[i] and b[n]. The example program uses

c(i,n)=abs(a[i]-b[n]))

Cells at the edge of the matrix are initialized like

DTW(i,0)=sum(c(i,0))

and

DTW(0,i)=sum(c(0,i))

Classic DTW can require a large amount of space. Take for example a short voice sample. With a sampling rate of 44100Hz/5=8820Hz, even a short word can generate easily a vector of 5000-8000 elements. If the reference vector is of a similar size, the DTW matrix will consist of 25 million to 64 million elements, all of them integer (meaning 4 bytes each). A data item of that size may easily exhaust the fixed maximum process space allocated to Android applications. Therefore we do a simple optimization. The output of our algorithm is the optimal distance which will be found at DTW(M,N), the right-bottom element in the matrix. In order to calculate elements in row i, we need only elements in row i-1 plus the already calculated elements in row i. We can drop row i-2 and calculate DTW(i,0) on the fly, when we get there. This will reduce the memory requirement to 2*M data items instead of N*M.

DTW, however naive its application in the example program, satisfies our RenderScript benchmark needs because

The kernel function is non-linear, it is not so evident to break it down to sub-ranges (although FastDTW does precisely that).

Calculation of matrix cells depend on an intermediate result of the calculation and not only on the input values.

The example program is attached to the end of this post. You need a free registration to the Sfonge site to access it. The application exercises the DTW algorithm on speech samples. Launch the application and record a reference sample. E.g. record a word like "four". Then record a same or different word from the same or different speaker and observe the distance that DTW calculates. If the environment is not too noisy, you will find that same word from the same speaker yields a (normalized) distance below 1000-1200 while other words from the same speaker or the same word from a different speaker yields higher distance. The calculation takes a bit of time because the DTW algorithm is executed twice, once in Java and once in RenderScript. The application displays the execution time for both implementations.



For now, you can observe that RenderScript execution times are 4-5 times smaller than Java execution times even though the implementation does not exploit any parallelization features. I will go deeper into this observation in the next post.

Show more