A comparative study of eight human auditory models of monaural processing
Abstract: A number of auditory models have been developed using diverging approaches, either physiological or perceptual, but they share comparable stages of signal processing, as they are inspired by the same constitutive parts of the auditory system. We compare eight monaural models that are openly accessible in the Auditory Modelling Toolbox. We discuss the considerations required to make the model outputs comparable to each other, as well as the results for the following model processing stages or their equivalents: Outer and middle ear, cochlear filter bank, inner hair cell, auditory nerve synapse, cochlear nucleus, and inferior colliculus. The discussion includes a list of recommendations for future applications of auditory models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
This paper compares eight computer models that try to mimic how one human ear (monaural processing) turns sound into brain signals. The models represent different parts of the hearing pathway—like the outer ear, middle ear, the cochlea, inner hair cells, the auditory nerve, and early brain areas—and they were all tested in the same way so we can see how similar or different their “hearing” is.
The big questions the researchers asked
- How do different hearing models handle the same sounds at each key stage of the hearing pathway?
- Do they respond similarly to quiet vs. loud sounds?
- How do they deal with fast sound details (temporal fine structure) and slower changes like beats and rhythms (temporal envelope)?
- What needs to be adjusted so their outputs can be fairly compared?
- Based on the comparison, what should people keep in mind when choosing a model for future work?
How they did the comparison (explained simply)
The team used the Auditory Modelling Toolbox (AMT), which is an open-source collection of hearing models. They grouped the models into three “families” based on how detailed they are:
- Biophysical models: Very detailed, like building the ear from tiny parts. They simulate how fluid and structures in the cochlea interact. Think of a complex machine with many connected parts.
- Phenomenological models: Medium detail. They are designed to match measured nerve and cochlea behavior using clever shortcuts. Think of a good imitation that captures key behaviors without modeling every tiny part.
- Functional-effective models: Simpler and fast. They aim to predict hearing performance (like what a listener can hear) rather than exact neural activity. Think of a “good-enough” graphic equalizer plus some smart rules.
To keep things fair, the same sounds were fed into all models:
- Steady pure tones at 500 Hz and 4000 Hz (low vs. high pitch)
- White noise (like static) covering a wide range of frequencies
- Different loudness levels: 40, 70, and 100 dB SPL
They then looked at the outputs after key stages:
- Cochlear filtering (how the ear splits sound into frequency bands—like a detailed equalizer)
- Inner hair cell (IHC) processing (turning motion into electrical signals)
- Auditory nerve (how signals are sent to the brain, including “adaptation” as the nerve gets used to a sound)
- Subcortical brain processing (early brain areas, especially the cochlear nucleus and inferior colliculus, which are sensitive to rhythms or “modulations” in sound)
They also aligned settings (like levels and filter parameters) so differences were due to the models themselves, not mismatched inputs.
A quick look at the models compared
| Model label | Family (type) | Key idea |
|---|---|---|
| dau1997 | Functional-effective (linear) | Simple, fast filters and modulation analysis |
| zilany2014 | Phenomenological | Dynamic filters + nerve adaptation; can feed brain-stage model |
| bruce2018 | Phenomenological | Updated nerve synapse; often used with the same brain-stage model |
| verhulst2015 | Biophysical | Nonlinear transmission-line cochlea; detailed hair cell and nerve |
| verhulst2018 | Biophysical | Extended, more detailed inner hair cell model |
| king2019 | Functional-effective (nonlinear) | Adds compression like automatic volume control, mainly on the “on-frequency” channel |
| relanoiborra2019 | Functional-effective (nonlinear, DRNL) | Dual-path filters that capture nonlinear cochlear behavior |
| osses2021 | Functional-effective (linear) | Clean, level-calibrated pipeline with modulation filters |
What they found and why it matters
1) Middle ear filtering changes “how loud” the cochlea sees the sound
Different models use different middle-ear filters. This matters because it shifts where compression (the ear’s automatic volume control) starts. Models with higher middle-ear gain push the cochlea into compression at lower input levels; lower gain delays compression. If you compare models without accounting for this, you might misinterpret their behavior.
2) Cochlear filtering and compression: on-frequency vs. off-frequency
- On-frequency channels (the cochlear filter tuned to the tone’s frequency) often showed compression: as you increase input level, output grows less than linearly—like an automatic volume limiter.
- Off-frequency channels (nearby filters) were usually more linear (output grows more proportionally), which matches biology. However, a couple of effective models also showed compression off-frequency, which can lead to unrealistic level balances between filters if not carefully set.
At very high levels, some models showed distortions in the frequency response (especially in the high tails of the filters). That’s a side effect of strong compression and depends on how the nonlinear stage is implemented.
3) Frequency selectivity (how sharp the filters are) changes with level
Filter sharpness is often measured with a “Q factor”: higher Q means sharper tuning.
- At lower levels (40 dB), biophysical and phenomenological models matched sharper tuning curves often used in physiological studies, while effective models matched broader tuning curves commonly used in perceptual models.
- As sounds got louder (70 to 100 dB), biophysical and phenomenological models’ filters broadened (Q dropped), especially at lower frequencies. That’s realistic: real cochlear filters get wider as level increases.
- Many effective models stayed nearly the same across levels (they’re simpler and often level-independent inside their main passband).
- One nonlinear effective model (king2019) mainly compresses the on-frequency channel, so its -3 dB bandwidth didn’t change much, even though there was broadening outside that core region.
This matters because filter sharpness affects how well the model separates nearby sounds (like notes in music or consonants in speech).
4) How many filters do you need?
If filters get wider at high levels, you need fewer of them to cover the frequency range without gaps. Biophysical models at 100 dB had much wider filters (so fewer were needed to cover the range). Effective models generally needed more filters to achieve the same overlap. This affects speed and memory: fewer filters mean faster computation, but only if the behavior matches your study needs.
5) Inner hair cell and auditory nerve: envelope extraction and adaptation
- Simple IHC stages act like envelope detectors: they keep slower changes (the “outline” of the sound) and reduce phase details at high frequencies.
- The more detailed IHC in verhulst2018 models the biophysics of hair cells more closely (three-channel Hodgkin–Huxley style), which can capture richer behavior.
- Auditory nerve stages include adaptation: the nerve fires a lot at sound onset, then settles down. Models simulate fibers with high, medium, or low “spontaneous rates” to reflect the variety found in biology. The researchers standardized fiber mixes and repeated simulations where needed to get stable results.
6) Early brain processing: rhythm detectors around ~80 Hz
All model families can feed into a stage that’s sensitive to sound modulations (the “beats” or rhythmic envelope). The biophysical and phenomenological models often used the SFIE circuit (Same-Frequency Inhibition-Excitation), which behaves like a broad modulation filter with a best rate near 80 Hz. Effective models use modulation filter banks (sets of rhythm detectors) with different tunings and ranges. This is important for predicting how speech clarity and certain sound features are tracked by the brain.
What this means in practice
- Choose the model that fits your goal:
- Want biological realism (for neuroscience)? Biophysical or phenomenological models.
- Want speed and good enough predictions of listening performance (for engineering or hearing-aid algorithms)? Functional-effective models.
- Be careful with levels and calibration. Middle-ear settings can shift where compression starts, changing model behavior.
- Check whether you need level-dependent filter behavior. Real ears have filters that broaden with loudness; many simple models don’t.
- For tasks involving rhythm or speech envelopes, make sure the modulation stage matches the rates you care about (e.g., ~4–16 Hz for speech syllable rates vs. ~80 Hz for certain brainstem sensitivities).
Why this research is useful
- It gives a clear, side-by-side look at eight widely used hearing models, showing where they agree and where they differ.
- It highlights the trade-off between realism and speed.
- It offers practical tips on configuration and comparability, encouraging reproducible research with open tools (AMT).
- The insights can guide better choices in:
- Designing hearing aids and audio processing
- Building speech intelligibility predictors
- Planning neuroscience experiments and simulations
- Creating machine-hearing systems that respect human hearing limits
In short, this study helps researchers and engineers pick and tune the right “digital ear” for their job, and reminds everyone to be careful when applying a model outside the conditions it was built or tested for.
Collections
Sign up for free to add this paper to one or more collections.