Latest Video & Multimodal Retrieval Papers: November 2025

by SLV Team 58 views
Latest Video & Multimodal Retrieval Papers: November 2025

Hey guys! Check out the freshest research in video and multimodal retrieval from November 10, 2025. This is your go-to spot for staying updated on the newest advancements in these exciting fields. For an even better reading experience and more papers, don't forget to visit the Github page. Let's dive in!

Video Retrieval

Title Date Comment
Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge 2025-11-05
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers 2025-11-03
Multi-Focused Video Group Activities Hashing 2025-11-03
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts 2025-11-02
Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum 2025-10-31
Mitigating Semantic Collapse in Partially Relevant Video Retrieval 2025-10-31
Accpe...

Accepted to NeurIPS 2025. Code is available at https://github.com/admins97/MSC_PRVR

AVA: Towards Agentic Video Analytics with Vision Language Models 2025-10-31
Accep...

Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix

Learning World Models for Interactive Video Generation 2025-10-29
Proje...

Project page: https://sites.google.com/view/vrag

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence 2025-10-24
Accep...

Accepted to NeurIPS 2025 D&B Track

Panorama: Fast-Track Nearest Neighbors 2025-10-23
Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval 2025-10-23
Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval 2025-10-21 5 pages
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models 2025-10-20
Accep...

Accepted by NeurIPS 2025; Project Page: https://walkermitty.github.io/VimoRAG

RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba 2025-10-18
Exten...

Extended version of ECCV 2024 paper arXiv:2407.01872. The dataset and code are released at https://github.com/KPeng9510/refAVA2

Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval 2025-10-14

In the realm of video retrieval, the latest research papers present exciting advancements aimed at enhancing how we search and interact with video content. One notable paper focuses on Multi-Object Tracking Retrieval with LLaVA-Video, proposing a training-free solution to the MOT25-StAG Challenge. This research signifies a leap forward in efficiently tracking and retrieving multiple objects within videos, streamlining complex video analysis tasks. Another significant contribution is Vote-in-Context, which explores turning Vision Language Models (VLMs) into zero-shot rank fusers. This approach could revolutionize how we rank and retrieve videos by leveraging the power of VLMs without the need for specific training data. Furthermore, the paper on Multi-Focused Video Group Activities Hashing delves into the intricate task of hashing video group activities, a critical area for understanding social interactions and events captured in videos. This research has implications for various applications, including surveillance, social media analysis, and event detection.

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts introduces a valuable resource for the community, providing a benchmark to evaluate long video retrieval systems in complex, multimodal scenarios. This benchmark will help researchers develop more robust and context-aware video retrieval models. The paper titled Towards Universal Video Retrieval aims at generalizing video embedding through a synthesized multimodal pyramid curriculum, showcasing progress towards creating systems that can understand and retrieve videos universally across different modalities and contexts. Another crucial area addressed is the mitigation of semantic collapse in partially relevant video retrieval, as discussed in Mitigating Semantic Collapse in Partially Relevant Video Retrieval. This research, accepted to NeurIPS 2025, offers a method to prevent the loss of semantic information, ensuring more accurate retrieval results. The code for this work is available on GitHub, facilitating further research and implementation.

The exploration of agentic video analytics with Vision Language Models is presented in AVA: Towards Agentic Video Analytics with Vision Language Models. Accepted to NDSI 2026, this 19-page paper with 12 figures provides complementary evaluations and an appendix, highlighting the potential of VLMs in understanding and interacting with video content. Additionally, Learning World Models for Interactive Video Generation introduces a project focused on interactive video generation, showcasing innovative ways to create and manipulate video content through learned world models. More details are available on the project page. MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark, accepted to the NeurIPS 2025 D&B Track, presents a comprehensive benchmark for evaluating multi-modal untrimmed video retrieval, adding another valuable resource for researchers. The paper Panorama: Fast-Track Nearest Neighbors offers a novel approach to accelerate nearest neighbor searches, a fundamental operation in video retrieval and analysis. This advancement can significantly improve the efficiency of video retrieval systems.

Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval focuses on refining the alignment between text and video content, enhancing the accuracy of text-video retrieval systems. Similarly, Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval explores how frame differences can guide dynamic region perception, further improving text-video retrieval. This 5-page paper provides a detailed analysis of this technique. The paper VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models, accepted by NeurIPS 2025, introduces a retrieval-augmented method for generating 3D motions from video, with a project page available for more information. RefAtomNet++: Advancing Referring Atomic Video Action Recognition extends previous work in atomic video action recognition, utilizing Semantic Retrieval based Multi-Trajectory Mamba. This extended version of an ECCV 2024 paper includes a released dataset and code, available on GitHub. Lastly, Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval presents a dual learning approach to improve video retrieval accuracy, addressing the challenges of partially relevant content. These papers collectively showcase the dynamic and evolving landscape of video retrieval research, paving the way for more intelligent and efficient video understanding systems.

Multimodal Retrieval

Title Date Comment
Toward Clinically Grounded Foundation Models in Pathology 2025-11-06
Caption Injection for Optimization in Generative Search Engine 2025-11-06
Evaluating Perspectival Biases in Cross-Modal Retrieval 2025-11-03
RzenEmbed: Towards Comprehensive Multimodal Retrieval 2025-10-31
CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning 2025-10-31
Accep...

Accepted by SIGIR-AP 2025

Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation 2025-10-28
https...

https://github.com/alexmartin1722/mirage

Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation 2025-10-26
Accep...

Accepted at NeurIPS 2025 UniReps Workshop

Open Multimodal Retrieval-Augmented Factual Image Generation 2025-10-26 Preprint
Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval 2025-10-25 NeurIPS 2025
TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval 2025-10-24
MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval 2025-10-17
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding 2025-10-17
Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking 2025-10-16
Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval 2025-10-16
12 pa...

12 pages, 6 figures, submitted for review

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval 2025-10-14
NeurI...

NeurIPS 2025; 27 pages, 6 figures

In the dynamic field of multimodal retrieval, the latest research papers explore innovative ways to bridge the gap between different data modalities, such as text, images, and audio, to enhance information retrieval. One significant paper, Toward Clinically Grounded Foundation Models in Pathology, investigates the application of foundation models in the medical domain, specifically in pathology. This work aims to improve diagnostic accuracy and efficiency by leveraging multimodal data. Another notable contribution is Caption Injection for Optimization in Generative Search Engine, which explores techniques to optimize generative search engines by injecting captions, thereby enhancing the relevance and quality of search results. The paper Evaluating Perspectival Biases in Cross-Modal Retrieval addresses a critical issue in multimodal retrieval: the presence of biases. This research provides insights into identifying and mitigating these biases to ensure fairness and accuracy in retrieval systems. RzenEmbed: Towards Comprehensive Multimodal Retrieval presents a novel approach to creating comprehensive multimodal embeddings, which are crucial for effectively representing and retrieving information across different modalities. This work contributes to the development of more robust and versatile retrieval systems.

CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning, accepted by SIGIR-AP 2025, explores the integration of planning capabilities in agentic multimodal retrieval augmented generation systems. This research opens up new possibilities for creating intelligent agents that can effectively retrieve and generate multimodal content. The paper titled Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation introduces a framework for evaluating multimodal retrieval augmented generation systems, providing valuable tools and metrics for assessing the performance of these systems. The associated code is available on GitHub. Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation, accepted at the NeurIPS 2025 UniReps Workshop, focuses on adaptive techniques in multimodal retrieval-augmented generation, showcasing how systems can dynamically adjust their retrieval and generation strategies. Open Multimodal Retrieval-Augmented Factual Image Generation presents a preprint on generating factual images using multimodal retrieval-augmented methods, advancing the state-of-the-art in image generation. Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval, a NeurIPS 2025 paper, introduces a reasoning-driven framework for multimodal retrieval, leveraging Multimodal Large Language Models (MLLMs) to enhance retrieval efficiency and accuracy.

TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval explores the challenges of grounding time series data in context for multimodal applications, providing insights into effectively integrating temporal data with other modalities. The paper MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval addresses the importance of modality composition awareness in multimodal retrieval, aiming to create more robust systems that can handle complex multimodal queries. Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding provides a comprehensive survey of multimodal retrieval-augmented generation techniques for document understanding, offering a valuable resource for researchers in this area. Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking investigates the effectiveness of different learning paradigms, such as supervised fine-tuning and contrastive learning, for improving multimodal Large Language Model reranking. The paper Multimodal RAG for Unstructured Data: Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval explores the use of modality-aware knowledge graphs in multimodal retrieval-augmented generation systems for unstructured data. This 12-page paper with 6 figures has been submitted for review. Lastly, Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval, a NeurIPS 2025 paper, introduces promptable embeddings for attribute-focused image retrieval, allowing users to retrieve images based on specific attributes. This 27-page paper with 6 figures provides a detailed analysis of this technique. These papers collectively represent the cutting-edge research in multimodal retrieval, highlighting the diverse approaches and challenges in this rapidly evolving field.