Research Papers

15 articles in this category

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel...

Mar 26, 20264 min read

arXiv

Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their...

Mar 26, 20264 min read

arXiv

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle...

Mar 26, 20264 min read

arXiv

RefAlign: Representation Alignment for Reference-to-Video Generation

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as...

Mar 26, 20264 min read

arXiv

Vega: Learning to Drive with Natural Language Instructions

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene...

Mar 26, 20264 min read

arXiv

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and...

Mar 26, 20264 min read

arXiv

PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully...

Mar 26, 20264 min read

arXiv

MegaFlow: Zero-Shot Large Displacement Optical Flow

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely...

Mar 26, 20264 min read

arXiv

How good was my shot? Quantifying Player Skill Level in Table Tennis

Gauging an individual's skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill...

Mar 26, 20264 min read

arXiv

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and...

Mar 26, 20264 min read

arXiv

Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based...

Mar 26, 20264 min read

arXiv

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal...

Mar 26, 20264 min read

arXiv

BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural...

Mar 26, 20264 min read

arXiv

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during...

Mar 26, 20264 min read

arXiv

PixelSmile: Toward Fine-Grained Facial Expression Editing

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective...

Mar 26, 20264 min read