Deep Learning-Based Videomics for Automatic Segmentation in Endoscopic Endonasal Surgery

Agosti, Edoardo; Pagnoni, Andrea; Zoia, Cesare; Rampinelli, Vittorio; Fiorindi, Alessandro; Panciani, Pier Paolo; Paderno, Alberto; Fontanella, Marco Maria

doi:10.62713/aic.4229

AIM: Videomics, the application of deep learning (DL) to endoscopic video, enables real-time tissue segmentation and anatomical recognition. Within endoscopic endonasal approaches, these methods may improve intraoperative visualization, tumor delineation, and surgical precision. Despite growing interest, its translation into routine clinical practice is still limited and not yet fully characterized. This systematic review aimed to synthesize current evidence on DL-based segmentation in endoscopic endonasal surgery, focusing on model architectures, segmentation targets, and reported outcomes. METHODS: This review was conducted according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. A systematic search of PubMed, Scopus, and Web of Science was performed on 12 January 2025, and updated on 5 June 2025. Studies published between 2018 and 2025 were included, as no eligible studies were available prior to 2018. Studies were included if they involved human endoscopic endonasal procedures and applied DL techniques to endoscopic video for segmentation purposes. Data extraction included sample size, image resolution, annotated datasets, DL architectures, segmentation targets, and model performance metrics. Study quality was assessed using the Newcastle-Ottawa Scale, and descriptive statistics were used to summarize findings. RESULTS: Out of 223 screened articles, 28 studies met the inclusion criteria, encompassing 154,989 patients and 1,028,440 annotated images. The most common segmentation targets included nasal polyps (25%), nasopharyngeal carcinoma (21.4%), and pituitary adenomas (7.14%). ResNet and YOLO architectures were each used in 5 studies (17.9%), while transformer-based models such as Swin Transformer, NasVLM, and NaMA-Mamba were increasingly utilized in recent years. Performance metrics were high across studies: area under the receiver operating characteristic curve (AUC-ROC) ranged from 87.4% to 99.2%, mean intersection over union [IoU] (mIoU) from 61.2% to 81.7%, and mean average precision (mAP) [0.50] from 53.4% to 94.9%. Inference times varied from 0.14 ms to 100 ms per image. However, only 35.7% of studies reported segmentation tools, and dataset heterogeneity was common. CONCLUSIONS: DL-based videomics demonstrates high segmentation accuracy across various pathologies and anatomical targets in endoscopic endonasal surgery. Models such as Swin Transformer and YOLO show potential for real-time surgical support. However, translation into clinical practice remains limited by dataset heterogeneity and variability in reporting.

IRIS Institutional Research Information System - OPENBS Open Archive UniBS