Latest Video & Multimodal Retrieval Papers: November 2025

Nov 10, 2025 by SLV Team 58 views

Hey guys! Check out the freshest research in video and multimodal retrieval from November 10, 2025. This is your go-to spot for staying updated on the newest advancements in these exciting fields. For an even better reading experience and more papers, don't forget to visit the Github page. Let's dive in!

Video Retrieval

Title	Date	Comment
Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge	2025-11-05
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers	2025-11-03
Multi-Focused Video Group Activities Hashing	2025-11-03
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts	2025-11-02
Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum	2025-10-31
Mitigating Semantic Collapse in Partially Relevant Video Retrieval	2025-10-31	Accpe... Accepted to NeurIPS 2025. Code is available at https://github.com/admins97/MSC_PRVR
AVA: Towards Agentic Video Analytics with Vision Language Models	2025-10-31	Accep... Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
Learning World Models for Interactive Video Generation	2025-10-29	Proje... Project page: https://sites.google.com/view/vrag
MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence	2025-10-24	Accep... Accepted to NeurIPS 2025 D&B Track
Panorama: Fast-Track Nearest Neighbors	2025-10-23
Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval	2025-10-23
Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval	2025-10-21	5 pages
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models	2025-10-20	Accep... Accepted by NeurIPS 2025; Project Page: https://walkermitty.github.io/VimoRAG
RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba	2025-10-18	Exten... Extended version of ECCV 2024 paper arXiv:2407.01872. The dataset and code are released at https://github.com/KPeng9510/refAVA2
Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval	2025-10-14

In the realm of video retrieval, the latest research papers present exciting advancements aimed at enhancing how we search and interact with video content. One notable paper focuses on Multi-Object Tracking Retrieval with LLaVA-Video, proposing a training-free solution to the MOT25-StAG Challenge. This research signifies a leap forward in efficiently tracking and retrieving multiple objects within videos, streamlining complex video analysis tasks. Another significant contribution is Vote-in-Context, which explores turning Vision Language Models (VLMs) into zero-shot rank fusers. This approach could revolutionize how we rank and retrieve videos by leveraging the power of VLMs without the need for specific training data. Furthermore, the paper on Multi-Focused Video Group Activities Hashing delves into the intricate task of hashing video group activities, a critical area for understanding social interactions and events captured in videos. This research has implications for various applications, including surveillance, social media analysis, and event detection.

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts introduces a valuable resource for the community, providing a benchmark to evaluate long video retrieval systems in complex, multimodal scenarios. This benchmark will help researchers develop more robust and context-aware video retrieval models. The paper titled Towards Universal Video Retrieval aims at generalizing video embedding through a synthesized multimodal pyramid curriculum, showcasing progress towards creating systems that can understand and retrieve videos universally across different modalities and contexts. Another crucial area addressed is the mitigation of semantic collapse in partially relevant video retrieval, as discussed in Mitigating Semantic Collapse in Partially Relevant Video Retrieval. This research, accepted to NeurIPS 2025, offers a method to prevent the loss of semantic information, ensuring more accurate retrieval results. The code for this work is available on GitHub, facilitating further research and implementation.

The exploration of agentic video analytics with Vision Language Models is presented in AVA: Towards Agentic Video Analytics with Vision Language Models. Accepted to NDSI 2026, this 19-page paper with 12 figures provides complementary evaluations and an appendix, highlighting the potential of VLMs in understanding and interacting with video content. Additionally, Learning World Models for Interactive Video Generation introduces a project focused on interactive video generation, showcasing innovative ways to create and manipulate video content through learned world models. More details are available on the project page. MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark, accepted to the NeurIPS 2025 D&B Track, presents a comprehensive benchmark for evaluating multi-modal untrimmed video retrieval, adding another valuable resource for researchers. The paper Panorama: Fast-Track Nearest Neighbors offers a novel approach to accelerate nearest neighbor searches, a fundamental operation in video retrieval and analysis. This advancement can significantly improve the efficiency of video retrieval systems.

Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval focuses on refining the alignment between text and video content, enhancing the accuracy of text-video retrieval systems. Similarly, Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval explores how frame differences can guide dynamic region perception, further improving text-video retrieval. This 5-page paper provides a detailed analysis of this technique. The paper VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models, accepted by NeurIPS 2025, introduces a retrieval-augmented method for generating 3D motions from video, with a project page available for more information. RefAtomNet++: Advancing Referring Atomic Video Action Recognition extends previous work in atomic video action recognition, utilizing Semantic Retrieval based Multi-Trajectory Mamba. This extended version of an ECCV 2024 paper includes a released dataset and code, available on GitHub. Lastly, Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval presents a dual learning approach to improve video retrieval accuracy, addressing the challenges of partially relevant content. These papers collectively showcase the dynamic and evolving landscape of video retrieval research, paving the way for more intelligent and efficient video understanding systems.

Multimodal Retrieval

Title	Date	Comment
Toward Clinically Grounded Foundation Models in Pathology	2025-11-06
Caption Injection for Optimization in Generative Search Engine	2025-11-06
Evaluating Perspectival Biases in Cross-Modal Retrieval	2025-11-03
RzenEmbed: Towards Comprehensive Multimodal Retrieval	2025-10-31
CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning	2025-10-31	Accep... Accepted by SIGIR-AP 2025
Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation	2025-10-28	https... https://github.com/alexmartin1722/mirage
Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation	2025-10-26	Accep... Accepted at NeurIPS 2025 UniReps Workshop
Open Multimodal Retrieval-Augmented Factual Image Generation	2025-10-26	Preprint
Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval	2025-10-25	NeurIPS 2025
TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval	2025-10-24
MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval	2025-10-17
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding	2025-10-17
Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking	2025-10-16
Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval	2025-10-16	12 pa... 12 pages, 6 figures, submitted for review
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval	2025-10-14	NeurI... NeurIPS 2025; 27 pages, 6 figures

In the dynamic field of multimodal retrieval, the latest research papers explore innovative ways to bridge the gap between different data modalities, such as text, images, and audio, to enhance information retrieval. One significant paper, Toward Clinically Grounded Foundation Models in Pathology, investigates the application of foundation models in the medical domain, specifically in pathology. This work aims to improve diagnostic accuracy and efficiency by leveraging multimodal data. Another notable contribution is Caption Injection for Optimization in Generative Search Engine, which explores techniques to optimize generative search engines by injecting captions, thereby enhancing the relevance and quality of search results. The paper Evaluating Perspectival Biases in Cross-Modal Retrieval addresses a critical issue in multimodal retrieval: the presence of biases. This research provides insights into identifying and mitigating these biases to ensure fairness and accuracy in retrieval systems. RzenEmbed: Towards Comprehensive Multimodal Retrieval presents a novel approach to creating comprehensive multimodal embeddings, which are crucial for effectively representing and retrieving information across different modalities. This work contributes to the development of more robust and versatile retrieval systems.

CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning, accepted by SIGIR-AP 2025, explores the integration of planning capabilities in agentic multimodal retrieval augmented generation systems. This research opens up new possibilities for creating intelligent agents that can effectively retrieve and generate multimodal content. The paper titled Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation introduces a framework for evaluating multimodal retrieval augmented generation systems, providing valuable tools and metrics for assessing the performance of these systems. The associated code is available on GitHub. Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation, accepted at the NeurIPS 2025 UniReps Workshop, focuses on adaptive techniques in multimodal retrieval-augmented generation, showcasing how systems can dynamically adjust their retrieval and generation strategies. Open Multimodal Retrieval-Augmented Factual Image Generation presents a preprint on generating factual images using multimodal retrieval-augmented methods, advancing the state-of-the-art in image generation. Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval, a NeurIPS 2025 paper, introduces a reasoning-driven framework for multimodal retrieval, leveraging Multimodal Large Language Models (MLLMs) to enhance retrieval efficiency and accuracy.

TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval explores the challenges of grounding time series data in context for multimodal applications, providing insights into effectively integrating temporal data with other modalities. The paper MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval addresses the importance of modality composition awareness in multimodal retrieval, aiming to create more robust systems that can handle complex multimodal queries. Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding provides a comprehensive survey of multimodal retrieval-augmented generation techniques for document understanding, offering a valuable resource for researchers in this area. Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking investigates the effectiveness of different learning paradigms, such as supervised fine-tuning and contrastive learning, for improving multimodal Large Language Model reranking. The paper Multimodal RAG for Unstructured Data: Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval explores the use of modality-aware knowledge graphs in multimodal retrieval-augmented generation systems for unstructured data. This 12-page paper with 6 figures has been submitted for review. Lastly, Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval, a NeurIPS 2025 paper, introduces promptable embeddings for attribute-focused image retrieval, allowing users to retrieve images based on specific attributes. This 27-page paper with 6 figures provides a detailed analysis of this technique. These papers collectively represent the cutting-edge research in multimodal retrieval, highlighting the diverse approaches and challenges in this rapidly evolving field.