Cosmos-H-Surgical

A surgical video world foundation model suite based on NVIDIA Cosmos and SurgWorld, part of the NVIDIA MedTech Open Models.

Overview

Cosmos-H-Surgical delivers high-quality video prediction and transfer for surgical scenes, including future-state simulation and control-conditioned generation across modalities. It comprises two sub-projects — Predict for image-to-video generation and Transfer for multi-modal control-based generation — both adapted from NVIDIA Cosmos 2.5 for surgical video data. For action-conditioned surgical simulation with robot kinematics, see the companion repo Cosmos-H-Surgical-Simulator.

This project was conducted by NVIDIA in collaboration with Chinese University of Hong Kong, National University of Singapore, and Shanghai Jiao Tong University.

News

[March 2026] — Released SurgΣ: a large-scale multimodal surgical dataset and foundation model suite for surgical intelligence.
[March 2026] — Released BSA: generalized recognition of basic surgical actions enabling skill assessment and VLM-based surgical planning.
[March 2026] — Released Cosmos-H-Surgical-Predict and Cosmos-H-Surgical-Transfer as part of the NVIDIA MedTech Open Models.

Model Variants

Model	Base Model	Params	Capability	Input	HuggingFace	License
Cosmos-H-Surgical-Predict	Cosmos-Predict2.5-2B	2B	Future-state video prediction (Image2World)	text + image	Weights	NVIDIA-OneWay-Noncommercial-License
Cosmos-H-Surgical-Transfer	Cosmos-Transfer2.5-2B	2B	Control-conditioned generation (depth, edge, seg, blur)	text + video + control maps	Weights	NVIDIA-OneWay-Noncommercial-License

Repository Structure

Cosmos-H-Surgical/
├── predict/                          # Cosmos-H-Surgical-Predict
│   ├── cosmos_predict2/              # Predict package (experiments, datasets)
│   ├── packages/                     # Workspace deps (cosmos-oss, cosmos-cuda, cosmos-gradio)
│   ├── docs/                         # Setup, inference, post-training
│   ├── examples/                     # Inference scripts
│   ├── assets/                       # Example inputs (JSON configs, images)
│   ├── Dockerfile
│   └── pyproject.toml
├── transfer/                         # Cosmos-H-Surgical-Transfer
│   ├── cosmos_transfer2/             # Transfer package
│   ├── packages/                     # Workspace deps
│   ├── docs/                         # Setup, inference, post-training
│   ├── examples/                     # Inference scripts
│   ├── assets/                       # Example inputs (JSON specs, control maps)
│   ├── Dockerfile
│   └── pyproject.toml
├── LICENSE                           # Apache 2.0 (source code)
└── LICENSE.weights                   # NVIDIA-OneWay-Noncommercial-License License (weights)

Quick Start

System Requirements: NVIDIA Ampere+ GPU (A100, H100, B200), Linux x86-64, CUDA 12.8+, Python 3.10+

Install

git clone git@github.com:NVIDIA-Medtech/Cosmos-H-Surgical.git
cd Cosmos-H-Surgical
git lfs pull

Each sub-project has its own environment. For example, to set up Predict:

cd predict
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python install && uv sync --extra=cu128
source .venv/bin/activate

See predict/docs/setup.md or transfer/docs/setup.md for full instructions including Docker.

Inference

Predict (Image2World):

cd predict
python examples/inference.py -i assets/base/coagulation.json -o outputs/base_video2world --inference-type=video2world

Transfer (control-conditioned):

cd transfer
python examples/inference.py -i assets/coagulation_example/depth/coagulation_depth_spec.json -o outputs/depth

Documentation

Topic	Predict	Transfer
Setup	predict/docs/setup.md	transfer/docs/setup.md
Inference	predict/docs/inference.md	transfer/docs/inference.md
Post-training	predict/docs/post-training.md	transfer/docs/post-training.md
Troubleshooting	predict/docs/troubleshooting.md	transfer/docs/troubleshooting.md

Performance

Transfer Inference (Segmentation control, 720p 16FPS, 93 frames, 65.4 GB VRAM)

GPU Hardware	Generation Time	End-to-End Time
NVIDIA B200	92.25 sec	186.92 sec
NVIDIA H100 NVL	445.52 sec	895.33 sec
NVIDIA H100 PCIe	264.13 sec	533.58 sec
NVIDIA H20	683.65 sec	1370.39 sec

End-to-end time measured for 121-frame input video (two 93-frame chunk generations). Guardrails disabled.

License

Component	License
Source code	Apache 2.0
Cosmos-H-Surgical-Predict weights	NVIDIA-OneWay-Noncommercial-License
Cosmos-H-Surgical-Transfer weights	NVIDIA-OneWay-Noncommercial-License

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Citation

@misc{he2026cosmoshsurgicallearningsurgicalrobot,
  title={Cosmos-H-Surgical: Learning Surgical Robot Policies from Videos via World Modeling},
  author={Yufan He and Pengfei Guo and Mengya Xu and Zhaoshuo Li and Andriy Myronenko and Dillan Imans and Bingjie Liu and Dongren Yang and Mingxue Gu and Yongnan Ji and Yueming Jin and Ren Zhao and Baiyong Shen and Daguang Xu},
  year={2026},
  eprint={2512.23162},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2512.23162},
}

@misc{zeng2026surgsigma,
  title={Surg$\Sigma$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence},
  author={Zhitao Zeng and Mengya Xu and Jian Jiang and Pengfei Guo and Yunqiu Xu and Zhu Zhuo and Chang Han Low and Yufan He and Dong Yang and Chenxi Lin and Yiming Gu and Jiaxin Guo and Yutong Ban and Daguang Xu and Qi Dou and Yueming Jin},
  year={2026},
  eprint={2603.16822},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2603.16822},
}

@misc{xu2026generalizedrecognitionbasicsurgicalactions,
  title={Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning},
  author={Mengya Xu and Daiyun Shen and Jie Zhang and Hon Chi Yip and Yujia Gao and Cheng Chen and Dillan Imans and Yonghao Long and Yiru Ye and Yixiao Liu and Rongyun Mai and Kai Chen and Hongliang Ren and Yutong Ban and Guangsuo Wang and Francis Wong and Chi-Fai Ng and Kee Yuan Ngiam and Russell H. Taylor and Daguang Xu and Yueming Jin and Qi Dou},
  year={2026},
  eprint={2603.12787},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.12787},
}

Resources

Cosmos-H-Surgical Paper (arXiv)
Basic Surgical Actions Dataset Paper (arXiv)
HuggingFace Collection
NVIDIA Cosmos Platform
Cosmos-H-Surgical-Simulator — Sister repo (action-conditioned surgical simulation)
NVIDIA MedTech Open Models

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
predict		predict
transfer		transfer
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE.weights		LICENSE.weights
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cosmos-H-Surgical

Overview

News

Model Variants

Repository Structure

Quick Start

Install

Inference

Documentation

Performance

Transfer Inference (Segmentation control, 720p 16FPS, 93 frames, 65.4 GB VRAM)

License

Citation

Resources

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cosmos-H-Surgical

Overview

News

Model Variants

Repository Structure

Quick Start

Install

Inference

Documentation

Performance

Transfer Inference (Segmentation control, 720p 16FPS, 93 frames, 65.4 GB VRAM)

License

Citation

Resources

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages