Tone Detection in Turkey: Doing MachineLearning Back in 2008

As part of 3CLogic, we often had to deal with VoIP based media processing to make a call to a 1800-xxx number or any number. If it "seems" like an "Answering Machine"," we need to detect/classify it. Normally this kind of DSP stuff is done by hardware integrated DSP chips. A similar system is shown below (Dialogic® MSP 1010) -

Drawing

But we don't have that luxury. We have a primitive Answering Machine Detection (AMD) system which is primarily based on cadence analysis. Although we do have a Gortzel based fax tone detection algorithm, up until now we don't have any sophisticated DSP-based processing as part of the core algorithm. Now one of our customer from Turkey reported really bad experience while dialing two particular Turkey number as far as AMD is concerned. All calls to these two numbers are supposed to be Answering Machine instead our software is declaring it as Human or live. They have a reason to be disgusted. When we analyzed the drama, -

  • Made call to some 88.x.y.z SER (router) server for a offending number say P (P@88.x.y.z)
  • Initial response 100/180 (SIP provisional response) came sharp
  • Then the early media is started by parallel media sever from 88.x.y.z1. The media is very similar to - "You have reached Mr. Gordon's residence..." - media one usually get after call gets connected.
  • After some time 200 OK (call gets connected) reached to our end and a tone follows with a silence.
  • Our primitive algorithm is detecting the tone as similar "Hello" and the subsequent silence made it sound like a "Hello, ..." and declared as a potential Human.

To make our life even tougher the tone came at various different frequencies and is not any of the standard DTMF tones. Even the Gortzel DFT based fax detection was tuned to 2100 Hz (Fax Answering Tone Freq.). What we need was a Singular Frequency Tone detection where the frequency is not fixed. We need some quick solution otherwise a big customer may be on its way out. One of these recorded media streams is shown below

Drawing

The first challenge is to find a pattern to identify the tone. We knew that we might need to turn to frequency / spectrum analysis for this puzzle to solve. We used Adobe Audition's built-in frequency analysis and results are enough to kick us going -

Drawing

The peak is clear there. Now there are various tone detection algorithms including SETI, but we decided to go our own - a decision which would prove useful later. We already had a license of Intel IPP libraries - http://www.intel.com/cd/software/products/asmo-na/eng/302910.htm - a high performance signal processing library. Initially FFT is applied to the whole recorded WAV sample (tone/voice). The results were disappointing and time was running by fast. We shifted our attention towards Windowed-FFT with a window size 1024 as shown in the Adobe Audition's frequency analysis panel. FFT size defines the frequency resolution we would measure as

F.R = sampling rate / FFT bin size

Now as we split the whole sample in 1024 windows (within which signal is not periodic), overlapping effect starts kicking in. To reduce such effects usually some windowed-preprocessing is done and then FFT is applied. Obtaining power was the obvious option. Let's jot down what steps we've gone through

Applied Hamming window over the window

status = ippsWinHamming_32f_I(in_dbl,  
windowSize);

// Show message box if status is wrong
IppErrorMessage("ippsWinHamming_32f_I", status);

if (status < 0)  
return;  

Apply FFT with order N=10 (etc. log2(1024))

status = ippsFFTFwd_RToCCS_32f(in_dbl,  
out,  
spec,  
NULL);

// Show message box if status is wrong
IppErrorMessage("ippsFFTFwd_RToCCS_32f", status);

if (status < 0)  
return;  

As for real signal, ignore other half of the FFT (which usually have complex conjugate)

Obtain magnitude vector from FFT

//Magnitude
status = ippsMagnitude_32fc((Ipp32fc*)out,  
power_spectrum,  
windowSize / 2);

// Show message box if status is wrong
IppErrorMessage("ippsMagnitude_32fc", status);

if (status < 0)  
return;  

Scale / normalize by window size.

//Scale the fft so that it is not a function of the length of x, mx = mx/length(x)

status = ippsDivC_32f_I((Ipp32f)(windowSize / 2),  
power_spectrum,  
windowSize / 2);

// Show message box if status is wrong
IppErrorMessage("ippsDivC_32f_I", status);

if (status < 0)  
return;  

Calculate power vector, square of scaled magnitude vector

//We need power ~ sqr(mag.)
status = ippsSqr_32f_I(power_spectrum,  
windowSize / 2);

// Show message box if status is wrong
IppErrorMessage("ippsSqr_32f_I", status);

if (status < 0)  
return;  

We were clueless after this. Should we go for a sophisticated peak detection algorithm or there is something for KISS strategy? IPP's high level peak detection function made us tempted enough to try it

IppStatus ippsFindPeaks_32f8u(  
const Ipp32f* pSrc,  
Ipp8u* pDstPeaks,  
int len,  
int searchSize,  
int movingAvgSize);  

But wait a minute after scanning couple of windowed-FFT plots it seems that -

  • In voice window, power is dispersed over the whole FFT spectrum-window

Drawing

  • In tone window, power is concentrated in 1 or 2 frequency bins (FFT output array index for the window) and for our Singular Frequency Tone, we may well concentrate on the frequency bin at which the max window power occurs

Drawing

Powered by this simple observation we decided to do following -

  • Normalize the power spectrum by dividing the power vector by the total power (iff total power for the window is non-zero)
  • Get the max normalized power of the window.
  • If the value crosses a pre-determined (trial) threshold over for some consecutive number (trial) of times we say that there may be a tone (SF) with a given degree of probability.

And you know, the strategy paid off beautifully. The idea worked consistently over many recorded samples including single frequency tones. We also can print or log the frequency as by following formula

frq := max_power_bin * F.R

But all the samples we have were offline recording and we need to integrate this little DSP logic into our VoIP / media stack (PJSIP) where frame (RTP, 10 / 20 ms of audio) used to come sequentially and our chosen window size (1024) is not a multiple of frame-length. So some residual sampling handling have to be done. After these integration efforts and some round of testing we became green to go