SAM-Audio: A Find for Spies

Meta continues to expand the capabilities of SAM (Segment Anything Model), and now audio modality has been added. You can isolate an object in a video and receive sound that comes exclusively from that point. As you can understand, this is a great find for spies, as you can isolate a conversation between two people in a video and hear only that, separating it from all other noise. Think for yourself about other applications of this. The project looks quite interesting. At its core is the Perception Encoder Audiovisual (PE-AV), which acts as the ears of the system. The architecture is built on a flow-matching diffusion transformer, which takes an audio mix and a prompt as input and generates target and residual audio tracks as output. The model can separate sound based on three types of prompts that can be combined: text, visual (clicking on an object in the video), and span prompting (highlighting a time segment when sound appears). However, it is currently not possible to isolate something very similar, for example, to cut out one singer from a choir. At the same time, the model works faster than real-time (RTF ≈ 0.7) and scales from 500M to 3B parameters. The weights and code are available as open source but under a non-commercial license (CC-BY-NC 4.0).

AI Tools Mentioned