|
Shuangrui Ding (丁双睿)
I am a final-year Ph.D. student in the Multi-Media Lab at The Chinese University of Hong Kong, supervised by Prof. Dahua Lin. My research focuses on vision-language models, video grounding, and long-horizon agent evaluation.
I obtained my Master's degree in Electrical Engineering from Shanghai Jiao Tong University in 2023, where I was advised by Prof. Hongkai Xiong. Prior to that, I earned a Bachelor's degree in Computer Science from the University of Michigan, along with a dual degree in Electrical and Computer Engineering from Shanghai Jiao Tong University in 2021.
I am currently a Research Scientist Intern at Meta Superintelligence Labs, Segment Anything Team, returning for my second internship with the team and working on video grounding with Jie Lei. Previously, I worked on SAM 3 with Nicolas Carion at Meta and on multimodal LLMs with Jiaqi Wang at Shanghai AI Laboratory.
I expect to graduate in Summer 2027 and am actively seeking Research Scientist opportunities in multimodal AI, video VLMs, and agent evaluation.
Email /
CV /
Google Scholar /
GitHub /
LinkedIn
|
|
|
News
[May. 2026] Released WildClawBench, a real-world long-horizon agent benchmark with 60 human-authored multimodal tasks.
[May. 2026] SetCon is online, introducing set-level concept prediction for open-ended referring segmentation.
[Jan. 2026] Three papers have been accepted at ICLR 2026.
[Nov. 2025] We released SAM 3, a unified model for detection, segmentation, and tracking of objects in images and video using text, exemplar, and visual prompts.
[Aug. 2025] Invited talk at LSVOS workshop at ICCV 2025.
[Jun. 2025] SAM2Long is accepted at ICCV 2025.
[May. 2025] SongComposer is accepted at the ACL 2025 main conference, and SongGen is accepted at ICML 2025.
[Mar. 2025] Three papers have been accepted at CVPR 2025.
|
|
Selected Publications (* equal contribution, † project lead)
|
|
|
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
Shuangrui Ding*†, Xuanlang Dai*, Long Xing*, Shengyuan Ding, Ziyu Liu, Jingyi Yang, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang
arXiv, 2026
Leaderboard /
arXiv /
code /
dataset
A real-world, long-horizon agent benchmark with 60 human-authored multimodal tasks across productivity, coding, search, social interaction, creative synthesis, and safety.
|
|
|
SAM 3: Segment Anything with Concepts
Nicolas Carion*, Laura Gustafson*, Yuan-Ting Hu*, Shoubhik Debnath*, Ronghang Hu*, Didac Suris*, Chaitanya Ryali*, Kalyan Vasudev Alwala*, Haitham Khedr*, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu°, Tsung-Han Wu°, Yu Zhou°, Liliane Momeni°, Rishi Hazra°, Shuangrui Ding°, Sagar Vaze°, Francois Porcher°, Feng Li°, Siyuan Li°, Aishwarya Kamath°, Ho Kei Cheng°, Piotr Dollar†, Nikhila Ravi†, Kate Saenko†, Pengchuan Zhang†, Christoph Feichtenhofer†
ICLR, 2026
Project Page /
paper /
blog /
demo /
code /
HF🤗
Unify detection, segmentation, and tracking of any concept in images and video using text and exemplar prompts.
|
|
|
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
Zhixiong Zhang,
Yizhuo Li,
Shuangrui Ding†,
Yuhang Zang,
Shengyuan Ding,
Long Xing,
Yibin Wang,
Qiaosheng Zhang,
Jiaqi Wang
arXiv, 2026
arXiv /
code /
HF model /
HF data
Reformulate open-ended referring segmentation as hierarchical set-level concept prediction, enabling interpretable mask-set decoding for image and video grounding.
|
|
|
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
Zhixiong Zhang*,
Shuangrui Ding*,
Xiaoyi Dong,
Songxin He,
Jianfan Lin,
Junsong Tang,
Yuhang Zang,
Yuhang Cao,
Dahua Lin,
Jiaqi Wang
ICLR, 2026
Project Page /
arXiv /
code /
HF model /
HF benchmark
Propose a VOS framework that leverages VLM for scene-level concept modeling, along with a new benchmark.
|
|
|
SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation
Shuangrui Ding*, Zihan Liu*, Xiaoyi Dong, Pan Zhang, Rui Qian, Junhao Huang, Conghui He,
Dahua Lin,
Jiaqi Wang
ACL main, 2025
arXiv
/
code
/
invited talk
/
demo page
A language large model that understands and generates melodies and lyrics in symbolic song representations.
|
|
|
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Rui Qian*,
Shuangrui Ding*,
Xiaoyi Dong,
Pan Zhang,
Yuhang Zang,
Yuhang Cao,
Dahua Lin,
Jiaqi Wang
CVPR, 2025
arXiv
/
code
Asynchronous operation of disentangled perception, decision, and reaction modules for online video LLMs.
|
|
|
Streaming Long Video Understanding with Large Language Models
Rui Qian,
Xiaoyi Dong,
Pan Zhang,
Yuhang Zang,
Shuangrui Ding,
Dahua Lin,
Jiaqi Wang
NeurIPS, 2024
arXiv
Long video understanding with disentangled streaming video encoding and LLM reasoning.
|
|
|
Rethinking Image-to-Video Adaptation: An Object-centric Perspective
Rui Qian,
Shuangrui Ding,
Dahua Lin,
ECCV, 2024
arXiv
Efficiently adapt image foundation models to video domain in an object-centric manner.
|
|
|
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
Shuangrui Ding*,
Rui Qian*,
Haohang Xu,
Dahua Lin,
Hongkai Xiong
ECCV, 2024
arXiv
/
code
Learn robust spatio-temporal correspondence on top of a DINO-pretrained Transformer without any annotation.
|
|
|
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
Shuangrui Ding,
Peisen Zhao,
Xiaopeng Zhang,
Rui Qian,
Hongkai Xiong,
Qi Tian
ICCV, 2023
Project page
/
arXiv
/
pdf
/
code
/
poster
/
slides
Propose token pruning strategy for video Transformers to offer a competitive speed-accuracy trade-off without additional training or parameters.
|
|
|
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos
Rui Qian,
Shuangrui Ding,
Xian Liu,
Dahua Lin
ICCV, 2023
arXiv
/
pdf
/
code
/
poster
Jointly utilizes high-level semantics and low-level temporal correspondence for object-centric learning in videos without any supervision.
|
|
|
Static and Dynamic Concepts for Self-supervised Video Representation Learning
Rui Qian,
Shuangrui Ding, Xian Liu,
Dahua Lin
ECCV, 2022
arXiv
/
code
/
slide
Learn static and dynamic visual concepts in videos to aggregate local patterns with similar semantics to boost unsupervised video representation.
|
|
|
Dual Contrastive Learning for Spatio-temporal Representation
Shuangrui Ding,
Rui Qian, Hongkai Xiong
ACM MM, 2022
arXiv
/
poster
/
video
/
code
Present a novel dual contrastive formulation to decouple the static/dynamic features and thus mitigate the background bias.
|
|
|
Motion-aware Contrastive Video Representation Learning via Foreground-background Merging
Shuangrui Ding,
Maomao Li, Tianyu Yang, Rui Qian, Haohang Xu, Qingyi Chen, Jue Wang, Hongkai Xiong
CVPR, 2022
Project page
/
arXiv
/
code
/
Chinese coverage
/
poster
Mitigate the background bias in self-supervised video representation learning via copy-pasting the foreground onto the other backgrounds.
|
|
|
Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization
Rui Qian, Yuxi Li, Huabin Liu, John See,
Shuangrui Ding, Xian Liu, Dian Li,
Weiyao Lin
ICCV, 2021
arXiv /
code
Self-supervised video representation learning from the perspective of both high-level semantics and lower-level characteristics
|
|
|
Towards More Practical Adversarial Attacks on Graph Neural Networks
Jiaqi Ma*,
Shuangrui Ding*,
Qiaozhu Mei
NeurIPS, 2020
arXiv
/
slides
/
video
/
code
Exploiting the structural inductive biases of GNNs, the restricted black-box adversarial attacks can be conducted effectively.
|
|
Awards
CUHK Vice-Chancellor's Ph.D. Scholarship (80,000 HKD), Graduate school of CUHK. 2023
Graduate National Scholarship (Top 2%), Ministry of Education of China. 2022
Shanghai Excellent Graduate (Top 5%), Shanghai Municipal Education Commission. 2021
Finalist winner (Top 0.3%), Mathematical Contest in Modeling. 2019
National Scholarship (Top 2%), Ministry of Education of China. 2018
|
|
Professional Services
Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, AAAI, ACL, ACM MM.
|
|
Invited Talks
Keynote Speaker, LSVOS Workshop at ICCV 2025, Honolulu, Hawaii. Talk: From Pixels to Meaning: Towards Reliable Video Object Segmentation across Frames.
Invited Speaker, InternLM Community Open Mic, Mar. 2024. Talk: When Songwriting Meets Large Models: SongComposer as an AI Musician.
|
|
Misc
1. My favorite sports is soccer. I was the captain of UM-SJTU JI soccer team during season 2018. Besides, I am a super fan of Manchester City in Premier League.
2. I am proud to have graduated from the competition class at Hangzhou No.2 High School, where I made friends with many talented students and inspiring teachers.
3. It is worth mentioning that Rui is my best friend and has motivated me forward for over ten years as my role model. Best wishes and good luck!
|
|