NIE Yingwang,WANG Lei,MEI Chenyang, et al. FR-PVT: A feature-refined pyramid vision transformer for accurate image segmentation[J]. JOURNAL OF WEZHOU MEDICAL UNIVERSITY, 2024, 54(8): 631-640.
Abstract:Objective: To accurately extract target regions in medical images used for morphological assessment and clinical disease monitoring, a hybrid network combining Convolutional Neural Network (CNN) and Transformer was explored to simultaneously learn local and global information in images. Methods: ①A
novel feature-refined segmentation network (referred to as FR-PVT) was developed by introducing a CNN-based decoder and integrating it with the pyramid vision transformer (PVT). The decoder was used to refine multiscale global features captured by the PVT, consisting of the feature refinement module (FRM), context attention module (CAM), and similarity aggregation module (SAM). ②To validate FR-PVT, it was used to segment polyps from five public colonoscopy image datasets (ClinicDB, ColonDB, EndoScene, ETIS, and KvasirSEG) and palpebral fissures from frame images in the eye videography dataset provided by the Eye Hospital of Wenzhou Medical University. ③The performance of FR-PVT was evaluated by four different metrics, including Dice coefficient, IOU, Matthews correlation coefficient (MCC), and Hausdorff distance (Hdf). The same segmentation tasks were compared between FR-PVT and the networks available (Polyp-PVT, U-Net, and its multiple variants).Results: ①The FR-PVT was able to handle colonoscopy images acquired under various imaging conditions and achieved average Dice coefficients of 0.937, 0.819, 0.892, 0.800, and 0.909, respectively, for the five different testing subsets from ClinicDB, ColonDB, EndoScene, ETIS, and KvasirSEG datasets. ②Experimental results on frame images from the eye videography dataset showed that the FR-PVT obtainedaverage Dice, IOU, MCC,and Hdf of 0.966, 0.943, 0.957, and 4.706, respectively. ③The segmentation performance on five polyp datasets showed that the FR-PVT obtained average Dice and IOU of 0.840 and 0.764, outperforming Polyp-PVT (0.834 and 0.760), U-Net (0.561 and 0.493), U-Net++ (0.546 and 0.476), SFA (0.476 and 0.367), PraNet (0.741 and 0.675). Performance differences on frame images from the eye videography dataset showed that the FR-PVT obtains average Dice and IOU of 0.840 and 0.764. Conclusion: The FR-PVT achieves better segmentation performance than Polyp-PVT and several CNN-based networks available (such as U-Net and its variants).