EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos (1602.03012v2)

Published 9 Feb 2016 in cs.CV

Abstract: Surgical workflow recognition has numerous potential medical applications, such as the automatic indexing of surgical video databases and the optimization of real-time operating room scheduling, among others. As a result, phase recognition has been studied in the context of several kinds of surgeries, such as cataract, neurological, and laparoscopic surgeries. In the literature, two types of features are typically used to perform this task: visual features and tool usage signals. However, the visual features used are mostly handcrafted. Furthermore, the tool usage signals are usually collected via a manual annotation process or by using additional equipment. In this paper, we propose a novel method for phase recognition that uses a convolutional neural network (CNN) to automatically learn features from cholecystectomy videos and that relies uniquely on visual information. In previous studies, it has been shown that the tool signals can provide valuable information in performing the phase recognition task. Thus, we present a novel CNN architecture, called EndoNet, that is designed to carry out the phase recognition and tool presence detection tasks in a multi-task manner. To the best of our knowledge, this is the first work proposing to use a CNN for multiple recognition tasks on laparoscopic videos. Extensive experimental comparisons to other methods show that EndoNet yields state-of-the-art results for both tasks.

Citations (767)

View on Semantic Scholar

Summary

The paper introduces EndoNet, a deep CNN that performs multi-task learning for simultaneous phase recognition and tool detection in laparoscopic videos.
It extends AlexNet with added fully-connected layers and weighted loss functions, eliminating the need for handcrafted features.
Experiments on the Cholec80 and EndoVis datasets demonstrate 81% mean AP in tool detection and robust phase recognition performance.

EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos

Introduction

This essay presents an expert overview of the paper titled "EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos" by Andru P. Twinanda et al. The paper proposes a novel deep learning approach for phase recognition and tool presence detection in laparoscopic cholecystectomy videos. The authors leverage a Convolutional Neural Network (CNN) architecture, named EndoNet, to learn visual features directly from surgical videos, thus obviating the need for handcrafted features and manual annotations typically required in conventional methods.

Problem Statement and Motivation

Surgical workflow recognition is critical for several applications in the modern operating room (OR), including real-time monitoring, staff scheduling, and automatic indexing of surgical videos. Traditional approaches to phase recognition have relied on handcrafted visual features or tool usage signals, which are either manually annotated or obtained using external equipment. Such methods are not only labor-intensive but also prone to loss of potentially significant information during feature extraction.

Methodology

EndoNet Architecture

EndoNet is designed to perform multi-task learning, integrating both phase recognition and tool presence detection. It extends the AlexNet architecture and introduces additional fully-connected layers to carry out both tasks simultaneously. Specifically, the network comprises five convolutional layers followed by two fully-connected layers. The distinctive aspect of EndoNet is its ability to learn both visual features and tool presence from laparoscopic videos, thus generating more discriminative features for phase recognition.

Training and Loss Functions

The authors employ a fine-tuning approach on a pre-trained AlexNet model using a dataset of cholecystectomy videos from the University Hospital of Strasbourg. The training objective optimizes two loss functions: the cross-entropy loss for tool presence detection and the softmax multinomial logistic loss for phase recognition. The final loss is a weighted sum of these two losses, allowing EndoNet to effectively learn features pertinent to both tasks.

Experimental Setup

The paper utilizes a large dataset, Cholec80, consisting of 80 annotated laparoscopic cholecystectomy videos. The dataset is split into a fine-tuning subset and an evaluation subset. The authors also validate the generalizability of EndoNet using the EndoVis dataset from the MICCAI 2015 challenge, which contains seven additional cholecystectomy videos.

Experimental Results

Tool Presence Detection

EndoNet achieved a mean average precision (AP) of 81% for tool presence detection across seven tool categories, outperforming traditional Deformable Part Models (DPM) and a single-task CNN (ToolNet) architecture. Notably, the architecture performed well even for tools with limited training samples, indicating its robustness.

Phase Recognition

Phase recognition results demonstrated the efficacy of EndoNet in both offline and online scenarios. EndoNet features, when used with a Hierarchical Hidden Markov Model (HHMM), yielded significant improvements in average precision, recall, and accuracy over handcrafted features, binary tool annotations, and features from single-task CNNs (PhaseNet). The results also indicated that incorporating tool presence detection into EndoNet facilitated the extraction of more discriminative features for phase recognition.

Practical Implications and Future Directions

The authors highlight two primary applications of EndoNet: automatic surgical video database indexing and detection of potential complications. The performance metrics for phase boundary detection indicate that EndoNet can significantly reduce the manual effort required for surgical video annotation. Further, the tool presence detection capability of EndoNet, particularly for critical tools like the clipper and bipolar, showcases its potential in preemptively identifying phases and alerting clinicians to possible complications.

EndoNet sets a strong foundation for future developments in AI-powered surgical workflow analysis. Potential advancements could involve integrating Long Short Term Memory (LSTM) networks to capture temporal dependencies directly within the CNN architecture, thus eliminating the need for separate temporal models like HHMM.

Conclusion

The comprehensive experiments and robust results presented in this paper underscore the potential of deep learning techniques, particularly CNNs, in enhancing surgical phase recognition and tool presence detection. EndoNet not only addresses the limitations of handcrafted features and manual annotations but also establishes a scalable and generalizable approach to automated surgical workflow analysis. This work represents a significant step toward the development of intelligent OR systems that can offer real-time assistance, improve surgical efficiency, and enhance patient outcomes.

PDF Markdown

Related Papers

YouTube

Show All Videos