Solving the cocktail party problem with a design inspired by the brain

On our recent paper, “Optimized feature gains explain and predict successes and failures of human selective listening,” by Ian Griffith, Josh McDermott, and me
By Preston Hess, May 6th, 2026

The cocktail party problem

It’s Friday night and you’re walking into the hottest new Italian spot with a group of friends. The restaurant is buzzing with sound---a good sign that you are about to have some great food. A hostess guides your group to your table and you take your seats. Once seated, you greet your friends. The conversation is easy. You listen to the person across from you without thinking about it. Then, the table next to you erupts into laughter and gets louder. You realize that someone nearby has a voice that sounds just like your friend’s. Suddenly, you’re completely lost. You smile and nod at your friend across the table but you can’t really hear what they said.

This ability to listen to one person’s voice among many other voices and sounds is called the “cocktail party problem.” Neuroscientists have studied it for a long time and have shown that sometimes it can be easy, and sometimes it is almost impossible for people to do. Why does this happen? Why is it easy to pay attention to one voice in a noisy room sometimes, and why does it feel impossible other times? We can answer these questions with a different approach: instead of asking when listening succeeds or fails, we can ask what the brain is actually doing when it decides whom to listen to in a buzzy new Italian restaurant on a Friday night.

Before getting into the science, try it yourself. First listen to the cue, or “target voice,” alone. Then listen to the mixture and see if you can follow that same person. (These are bit quiet, so you may have to turn your volume up to hear them)

How the brain pays attention

One way the brain makes sense of the world is by breaking it into features. A face, a voice, or a word is not represented by one neuron for every example you see or hear. Instead, many neurons respond to smaller features of the signal such as edges, colors, pitches, textures, locations, and so on. Attention seems to change how strongly some of those feature-detectors matter. How did we find this out? We turn to a really cool result from the late 1950s and another one from the late 1990s.

In 1959, two scientists, David Hubel and Torsten Wiesel, were trying to figure out what individual neurons in the visual part of the brain respond to. To do this, they were recording neurons in the brains of cats. Interestingly, we see that the brain represents objects by representing “features” that make up those objects¹. For example, if I asked you to look for a green box, you might have some neurons representing features that boxes have, like corners and straight lines, and then some neurons representing the color green. Hearing can be thought of in a similar way. A voice has features like pitch, tone, accent, location, rhythm, and other acoustic details that help separate one person from another. Neurons typically represent a certain thing that they “like,” and when you are seeing or hearing the thing it likes, the neuron is very active. Roughly speaking, one neuron might act like a little detector that says, “this feature is present.”

Mock tuning curve of a visual neuron that responds to the color “green”

This also extends to many other animals, including monkeys. In the late ‘90s, already knowing that some neurons in the monkeys’ brains were sensitive to particular features, the scientists Stefan Treue and Julio Martínez Trujillo asked, “What happens to these neurons when we ask the monkeys to pay attention to the things they prefer?” What they found is, in a word, bananas. When the monkeys pay attention to something that the neuron already ‘likes,’ it becomes even more active. Conversely, if the monkey pays attention to something the neuron doesn’t fire for, it becomes less active². The most important takeaway from this study is that the brain does not seem to simply switch those neurons on or off. It scales their activity up or down, like turning a dial. Scientists call this kind of scaling a “multiplicative gain.” In plain English, it means boosting the activity of neurons that care about the thing you are trying to attend to.

Attention multiplicatively changes the response of a neuron

Pay attention to green No attention Pay attention to another color

No attention shows the baseline response. Attending to green increases the neuron’s response. Attending elsewhere suppresses it.

Now, while the results of this study help us to understand attention’s effect on neurons, it leaves us wondering, “If attention works by boosting the features of what we care about, could that alone explain how we listen in noisy environments?”

Let’s try it in a machine

Artificial neural networks are computer models loosely inspired by the brain. They are trained on many examples until they learn to recognize patterns, such as objects in images or words in speech. One useful thing about these models is that, like brains, their “neurons” respond to features. This became especially clear in vision models after the deep learning breakthroughs of the early 2010s, when researchers found that artificial networks trained to recognize images developed feature detectors that looked surprisingly similar to ones studied in brains³.

Features learned by a neural network

Examples of visual features learned by AlexNet

These small images show patterns that neurons of a vision neural network learned to detect after being trained to recognize objects. The model became sensitive to simple visual patterns, like edges, colors, and textures. This helped show that artificial neural networks can develop feature detectors that resemble some of the feature detectors found in the visual region of the brain (Krizhevsky et al., 2012).

Knowing that these features also exist in artificial neural networks opens up the possibility for us to use them as a tool to answer the question we posed at the end of the last section. If we boost the model’s internal features the way attention seems to boost brain responses, can we make the model behave like it is paying attention? Does this work for hearing the same way it works for vision?

There was already a clue that this might work. In 2018, Grace Lindsay and Ken Miller tested a model of feature boosting in vision. They showed a neural network images containing several objects at once. When they boosted the model features associated with one object, the model became more likely to report that object⁴. Their study suggested feature boosts are possibly enough to create attention-like behavior in a machine. However, it was still unclear whether the same idea would work for speech, or whether this type of model could actually predict human behavior in new environments.

What we actually did

Our model played a simple listening game. On each trial, it first heard one person speaking alone. This was the cue, or the voice to listen for. Then it heard a mixture where that same person was speaking at the same time as other sounds. The model’s job was to report the word spoken by the target person in the mixture, just like what you tried to do at the beginning of this blog post. This gave us a well-defined way to train the model. We knew the correct answer on every trial, so the model could learn from its mistakes (this is called supervised learning in the machine learning world). This task was inspired by the idea of using the memory of a sound of a voice to pick that voice out later.

The key change in our model was simple: we gave it a way to learn which features to boost and how much to boost them. When the model heard the target talker alone, it could ‘remember’ which internal features became active. Then, when the model heard the mixture, it could boost those same features. In that sense, we let the model try to pay attention in the same way that we think the brain pays attention.

How the model listens for the target voice

Cue voice

target speaker alone

These features mark the voice to listen for.

→

Mixture

target voice plus other sounds

Cue-matching features get boosted.

→

Target word

model’s answer

“tomorrow”

The model uses the cue voice to decide which internal features matter. When the mixture arrives, features that match the cue are boosted, helping the model recover the target word.

Did that actually work?

After the model was trained, we were able to show it new examples that it had not seen during training. It was the same task, but with talkers and mixtures it hadn’t heard before. Then we gave humans the same listening tests, so we could compare the model’s mistakes and successes to real behavior. Like the model, human listeners heard the target voice, then the mixture, and then guessed the target word. In the first set of experiments, these were presented to humans over headphones. To imitate this, we presented the sounds to the model as if they were coming through headphones, too.

We saw some pretty cool things. Overall, the model matched human behavior pretty well when it comes to getting the word right. In the two plots below, you see human accuracy and model accuracy for two subsets of the experiment. Before we look at the results of the experiment, let’s listen to a quick example. In the mixture, the additional voice is called the distractor---it’s the voice you’re supposed to ignore. The example below uses the same target talker as above, but now the mixture has the target talker speaking at the same time as a distractor talker that is the same sex. Take a listen. It suddenly becomes harder than when they were speaking with a distractor talker of a different sex, right?

The first result showed that this is also true for the model. When the distractor sounded more like the target, both humans and the model struggled more. Same-sex distractors were harder than different-sex distractors. English distractors were harder than Mandarin distractors for English-speaking listeners. The model showed the same pattern. This makes intuitive sense. The more similar the background voice is to the voice you are trying to attend to, the harder it is to keep them apart.

The model made human-like listening errors

a. Same-sex distractors were harder than different-sex distractors for both humans and the model. b. English distractors were harder than Mandarin distractors for both humans and the model. SNR means signal-to-noise ratio: negative values mean the target voice is quieter than the background, so the listening problem is harder. Hover over dots with your mouse to see their values.

We want the model to match all aspects of human behavior. That means if humans are getting a specific example wrong much more than they are getting it right, the model should also do the same thing. To measure this, we looked at cases with only two talkers. In these cases, humans could make a very particular kind of mistake. Sometimes people did not just guess randomly. They reported the word spoken by the wrong person. We call this particular type of mistake a ‘confusion.’

That kind of error is especially revealing because it means attention selected the distractor instead of the target. As you can see below, the model didn’t only get things right in the same way, it got things wrong in the same way as humans, too!

The model made the same kind of mistakes as humans

Humans and the model were most likely to report the distractor word when the target voice was hardest to hear. As the SNR increased, the target became easier to separate from the distractor, and confusions decreased. SNR means signal-to-noise ratio: negative values mean the target voice is quieter than the distractor, so the listening problem is harder.

Listening is easier when voices come from different places

The experiments above showed differences when the properties of the two talkers were more or less similar to each other, but they were presented over headphones, with no information about spatial location. Let’s think back to our buzzy Italian restaurant. When you are actually talking in a restaurant, some of the other people talking are closer to or farther from the person you are trying to listen to. In fact, we know that the farther away the distractor talker is from the target talker, the easier it is for humans to hear the target talker. To show this, we used a large speaker array that let us place voices at different locations around a listener. Unlike the first set of experiments, these trials included information about where sounds were coming from, instead of just what they were. We were able to test the exact same examples on our model and compare their performance. The model showed the same basic pattern as humans. The farther apart the voices were, the easier the task became.

Separating voices in space made listening easier

When the distractor voice moved farther away from the target voice, both humans and the model needed less help from the target being louder. In other words, spatial separation made it easier to pick out the voice of interest. Human data were scanned in from the original publication⁵ and replotted.

Finding something new with our model

Up to this point, the story was pretty clear. Our model behaves a lot like people do. It succeeds in the same situations and fails in many of the same ones. But the most exciting test was not whether the model could explain results we already knew about. It was whether the model could point us toward something new.

We were very interested in the spatial effects of attention. Because running experiments on people is slow, it is impractical to test every possible combination of target and distractor locations. But with the model, we could evaluate them all. By first exploring this huge range of listening conditions in the model, we found a few especially interesting patterns that we hadn’t seen reported in humans before. Using these specific location settings, we designed focused experiments to see whether humans behaved the same way.

At first, you might think separation is separation. In other words, if two voices are farther apart, listening should get easier no matter which direction they move. But the model predicted something more specific. Separating voices left or right made hearing the target word easier than separating them the same amount up or down. Intrigued, we ran the same experiment in humans, and the same pattern showed up. One possible reason for this is that left-right location gives your two ears different information, while up-down location depends on subtler cues shaped by the outer ear.

Left-right separation helped people pick out the correct word more than up-down separation

Moving the distractor left or right made the listening task much easier for both humans and the model. Moving it up or down helped much less. This suggests that the model learned to use spatial cues in a way that resembles human listening.

Second, we saw that when the target voice was directly in front of the model, even a small distance left or right produced a noticeable improvement in picking out the word that the target talker said. But when the target voice was off to the side, the model needed a much larger separation to get the same benefit. It’s as if the “spotlight” of attention is sharper at the center and becomes less precise off to the side. This isn’t something that had been established by past literature, but once again, when we tested it in humans, it was real.

Attention works best when the target voice is directly in front

When the target voice was directly in front, even a small separation from other voices made a big difference. But when the target voice was off to the side, the system needed much more separation before it helped. This pattern was the same in people and in the model, suggesting that attention is most precise at the center and less precise away from it.

What’s really interesting is that neither of these effects was engineered into the model or discovered from testing humans first. They naturally happened to show up from a model that was only trained to pick out one voice in a mixture using feature-based attention. The model went beyond just matching behavior, and became a tool for exploring it and revealing patterns we never knew about before!

What this means

So what should we take away from all of this?

Our conclusion is that selective listening may not require a mysterious extra ingredient. A surprisingly simple idea went a long way: boost the features of the voice you want to attend to. When we built that idea into a model and trained it to recognize words in mixtures, it ended up successfully paying attention to the target talker in many of the same situations as human listeners.

What’s perhaps even more exciting is not just that the model succeeds when humans do, but that it fails in the same ways, too. That suggests that some of our difficulties hearing people in specific settings are not just mistakes or lapses, but reflect the limitations of the strategy your brain is using to separate one voice from another. There are situations where feature-based gain simply isn’t enough to cleanly separate one voice from another.

We still have a long way to go, of course. For example, in real life you do not always get a clean preview of someone’s voice before the restaurant gets buzzy. You can also decide to listen to “the person on my left” or “the person who just said my name,” which our model does not fully capture yet. But we are pretty excited about this model being a good first step.

Closing

Thank you so much for reading---I hope you think about this next time you’re at dinner with your friends! To make this study more digestible, I’ve skipped over a lot of details, but the full paper has much more about the model, the experiments, and the results. If you’re interested, you can read it here.

This is my first blog post, so I’m still figuring out what works. If anything was unclear, confusing, or particularly interesting, I’d love to hear your thoughts. If you’re curious about the project or would like to discuss any part of it further, please feel free to reach out.

References

1. Hubel, D. H., & Wiesel, T. N. Receptive fields of single neurones in the cat's striate cortex. The Journal of physiology, 148(3), 574–591 (1959). https://doi.org/10.1113/jphysiol.1959.sp006308

2. Treue, S., & Trujillo, J. Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399, 575–579 (1999). https://doi.org/10.1038/21176

3. Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).

4. Lindsay, G. W., & Miller, K. D. How biological attention mechanisms improve task performance in a large-scale visual system model. eLife (2018). https://doi.org/10.7554/eLife.38105

5. Byrne, A. K., Conroy, C., & Kidd, G. Individual differences in speech-on-speech masking are correlated with cognitive and visual task performance. Journal of the Acoustic Society of America 154, 2137-2153 (2023). https://doi.org/10.1121/10.0021301

6. Griffith, I. M., Hess, R.P., & McDermott, J.H. Optimized feature gains explain and predict successes and failures of human selective listening. Nature Human Behaviour (2026). https://doi.org/10.1038/s41562-026-02414-7

← Back to Blog