Computational auditory scene analysis

Computational auditory scene analysis (CASA) is the study of auditory scene analysis by computational means. [1] In essence, CASA systems are “machine listening” systems that are likely to have separate sources of sound sources. CASA differs from the field of blind signal separation fait que it is (at least to Some extent) based on the Mechanisms of the human auditory system , and THUS uses no more than two microphone recordings of an acoustic environment. It is related to the cocktail party problem .


Since CASA serves the role of the auditory system, it is necessary to provide information on the subject. The auditory periphery is a complex transducer that converts sound vibrations into action potentials in the auditory nerve. The outer ear consists of the outer ear , ear canal and the ear drum . The outer ear, like an acoustic funnel, helps locating the sound source. [2] The ear canal acts as a resonant tube (like an organ pipe) to amplify frequencies between 2-5.5 kHz with a maximum amplification of about 11dB occurring around 4 kHz. [3] As the organ of hearing, theCochlea consists of two membranes, Reissner’s and the basilar membrane . The basilar membrane moves to audio stimuli through the specific stimulus frequency matches the resonant frequency of a basilar membrane region. The movement of the basilar membrane displaces the inner hair cells in one direction, which encodes a half-wave rectified signal of action potentials in the spiral ganglion cells. The axons of these cells make up the auditory nerve, encoding the rectified stimulus. The auditory nerve responses select certain frequencies, similar to the basilar membrane. For lower frequencies, the fiber exhibit “phase locking”. Neurons in higher auditory pathway centers are tuned to specific stimuli features, such as periodicity, sound intensity, amplitude and frequency modulation.[1] There are also neuroanatomical associations of ASA through the posterior cortical areas, including the posterior superior temporal lobes and the posterior cingulate . ASA and segregation and grouping operations are affected by Alzheimer’s disease . [4]

System Architecture


As the first stage of CASA processing, the cochleagram creates a time-frequency representation of the input signal. By mimicking the components of the outer and middle ear, the signal is broken up by different frequencies that are naturally selected by the cochlea and hair cells. Because of the frequency selectivity of the basilar membrane, a filter bank is used to model the membrane With Each filter associated with a specific spot on the basilar membrane. [1]

Since the hair cells produce spike patterns, each filter of the model should also produce a similar spike in the impulse response . The use of a gammatone filterprovides an impulse response to the product of a gamma function and a tone. The output of the gamma filter can be regarded as a measurement of the basilar membrane displacement. Most CASA systems represent the firing rate in the auditory nerve rather than a spike-based. To obtain this, the filter bank is a half-wave rectified followed by a square root. (Other models, such as automatic gain controllers have been implemented). The half-rectified wave is similar to the displacement model of the hair cells. Additional models of the hair cells include the hair cell model that pairs with the gammatone filter bank, by modeling the hair cell transduction. [5]Based on the assumption that there are three reservoirs of transmitter substance in each cell, and the transmitters are released in proportion to the degree of displacement of the membrane basilar, the release is equated with the probability of a spike generated in the fiber nerve. This model replicates many of the responses in the CASA systems such as rectification, compression, spontaneous firing, and adaptation. [1]


Important model of pitch perception by unifying 2 schools of pitch theory: [1]

  • Place theories (emphasizing the role of resolved harmonics)
  • Temporal theories (emphasizing the role of unresolved harmonics)

The correlation is computed in the time domain by autocorrelating the simulated auditory nerve firing activity to the output of each filter channel. [1] By pooling the autocorrelation across frequency, the position of peaks in the summary correlogram corresponding to the perceived pitch. [1]


Because the ears receive audio signals at different times, the sound source can be determined by the two sounds. [6] By cross-correlating the delays of the left and right channels (of the model), the coincided peaks can be categorized as the same localized sound, despite their temporal location in the input signal. [1] The use of interaural cross-correlation mechanism has been supported by physiological studies, a parallel analysis of neurons in the auditory midbrain . [7]

Time-Frequency Masks

To segregate the source sound, CASA systems mask the cochleagram. This mask, sometimes a Wiener filter , weighs the target source regions and suppresses the rest. [1] The physiological motivation behind the mask results from the auditory perception where sound is rendered inaudible by a louder sound. [8]


A resynthesis pathway reconstructs an audio signal from a group of segments. Achieved by inverting the cochleagram, high quality resynthesized speech signals can be obtained. [1]


Monaural CASA

Monaural sound separation first with separating voices based on frequency. There have been many early developments based on segmenting different speech signals through frequency. [1] Other models followed by this process, by the addition of adaption through state space models, batch processing, and prediction-driven architecture. [9] The use of CASA has improved the robustness of ASR and speech separation systems. [10]

Binaural CASA

Since CASA is modeling human auditory pathways, binaural CASA systems better the human model by providing sound localization, auditory grouping and robustness to reverberation by 2 spatially separated microphones. With methods similar to cross-correlation, systems are able to extract the target signal from both input microphones. [11] [12]

Neural CASA Models

Since the biological auditory system is deeply connected to the actions of neurons, CASA systems has incorporated neural models within the design. Two different models provide the basis for this area. Malsburg and Schneider proposed a neural network model with oscillators to represent features of different streams (synchronized and desynchronized). [13] Wang also presented a model using a network of excitatory units with a global inhibitor with delay lines to represent the auditory scene within the time-frequency. [14] [15]

Analysis of Musical Audio Signals

Typical approaches in CASA systems starts with segmenting sound-sources into individual constituents, in its attempts to mimic the physical auditory system. However, there is evidence that the brain does not need to be included, but rather as a mixture. [16] Instead of breaking the audio signal down to individual constituents, such as chords, bass and melody, beat structure, and chorus and phrase repetitions. These descriptors run into difficulties in real-world scenarios, with monaural and binaural signals. [1]Also, the estimation of these descriptors is highly dependent on the cultural influence of the musical input. For example, in Western music, the melody and bass influences the identity of the piece, with the core formed by the melody. By distinguishing the frequency responses of melody and bass, a fundamental frequency can be estimated and filtered for distinction. [17] Chord detection can be implemented through pattern recognition, by low-level extracting features describing harmonic content. [18] The techniques used in music scene analysis can also be applied to speech recognition , and other environmental sounds. [19]Future bodies of work include a top-down integration of audio signal processing, such as a real-time beat-tracking system and an expansion of the signal processing realm with the incorporation of auditory psychology and physiology. [20]

Neural Perceptual Modeling

While many models consider the audio signal as a complex combination of different frequencies, modeling the auditory system can also require consideration for the neural components. By taking a holistic process, where a stream (of feature-based sounds) corresponds to neuronal activity distributed in many brain areas, the perception of the sound could be mapped and modeled. Two different solutions have been proposed to the binding of the audio perception and the area in the brain. Hierarchical coding models many cells to encode all possible combinations of features and objects in the auditory scene. [21] [22]Temporal or oscillatory correlation addressing the binding problem by focusing on the synchrony and desynchrony between neural oscillations to encode the state of binding and auditory features. [1] These two solutions are very similar to the debacle between place coding and temporal coding. While drawing from modeling of neural components, another phenomenon of ASA comes into play with CASA systems: the extent of modeling neural mechanisms. The studies of CASA systems have been used in the past, but they are not limited to other mechanisms. . [23]

See also

  • auditory scene analysis
  • blind signal separation
  • cocktail party problem
  • machine vision
  • speech recognition

Further reading

DF Rosenthal and HG Okuno (1998) Computational auditory scene analysis. Mahwah, NJ: Lawrence Erlbaum


  1. ^ Jump up to:m Wang and Brown DL, GJ (Eds.) (2006). Computational auditory scene analysis: Principles, algorithms and applications . IEEE Press / Wiley-Interscience
  2. Jump up^ Warren, R. (1999). Auditory Perception: A New Analysis and Synthesis. New York: Cambridge University Press.
  3. Jump up^ Wiener, F. (1947), “On the diffraction of a progressive wave by the human head”. Journal of the Acoustical Society of America,19, 143-146.
  4. Jump up^ Goll, J., Kim, L. (2012), “Impairments of auditory scene analysis in Alzheimer’s disease”,Brain 135 (1), 190-200.
  5. Jump up^ Meddis, R., Hewitt, M., Shackleton, T. (1990). “The details of a computational model of the inner hair-cell / auditory nerve synapse”. Journal of the Acoustical Society of America 87 (4)1813-1816.
  6. Jump up^ Jeffress, LA(1948). “A place of theory of sound localization”. Journal of Comparative and Physiological Psychology,4135-39.
  7. Jump up^ Yin T. Chan, J. (1990). “Interaural time sensitivity in medial superior olive of cat”Journal Neurophysiology,64 (2)465-488.
  8. Jump up^ Moore, B. (2003). An Introduction to the Psychology of Hearing(5th ed.). Academic Press, London.
  9. Jump up^ Ellis, D (1996). “Predictive Driven Computational Auditory Scene Analysis”. PhD thesis, MIT Department of Electrical Engineering and Computer Science.
  10. Jump up^ Li, P., Guan, Y. (2010). “Computer Speech and Language,Monaural speech separation based on MASVQ and CASA for robust speech recognition”,24, 30-44.
  11. Jump up^ Bodden, M. (1993). “Modeling human sound-source rentals and cocktail party effect”Acta Acustica 143-55.
  12. Jump up^ Lyon, R. (1983). “A computational model of binaural rentals and separation”. Proceedings of the International Conference on Acoustics, Speech and Signal Processing1148-1151.
  13. Jump up^ Von der Malsburg, C. Schneider, W. (1986). “A neural cocktail-party processor”. Biological Cybernetics 5429-40.
  14. Jump up^ Wang, D. (1994). “Auditory stream segregation based on oscillatory correlation”. Proceedings of the IEEE International Workshop on Neural Networks for Signal Processings, 624-632.
  15. Jump up^ Wang, D. (1996), “Primitive auditory segregation based on oscillatory correlation”. Cognitive Science 20, 409-456.
  16. Jump up^ Bregman, A (1995). “Constraints on computational models of auditory scene analysis as derived from human perception”. The Journal of the Acoustical Society of Japan (E),16 (3), 133-136.
  17. Jump up^ Goto, M. (2004). “A real-time music-scene-description system: predominate-F0 estimate for detecting melody and bass lines in real-world audio signals”. Speech Communication,43, 311-329.
  18. Jump up^ Zbigniew, R., Wieczorkowska, A. (2010). “Advances in Music Information Retrieval”. Studies in Computational Intelligence,274119-142.
  19. Jump up^ Masuda-Katsuse, I (2001). “A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise”. Proceedings Eurospeech, 1119-1122.
  20. Jump up^ Goto, M (2001). “An Audio-based real-time beat tracking system for music with or without drum sounds”. Journal of New Music Research,30 (2): 159-171.
  21. Jump up^ deCharms, R., Merzenich, M, (1996). “Primary cortical representation of sounds by the coordination of action-potential timing”. Nature,381, 610-613.
  22. Jump up^ Wang, D. (2005). “The time dimension of scene analysis”. IEEE Transactions on Neural Networks,16 (6), 1401-1426.
  23. Jump up^ Bregman, A. (1990). Auditory Scene Analysis. Cambridge: MIT Press.