Deep ensembles and data augmentation for semantic segmentation

Nanni, L.; Lumini, A.; Loreggia, A.; Brahnam, S.; Cuza, D.

doi:10.1016/B978-0-323-96129-5.00009-3

The task of classifying each pixel in an image is known as semantic segmentation in the context of computer vision, and it is critical for image analysis in many domains. Semantic segmentation, for example, is required in clinical practice to improve accuracy in identifying potential pathologies, such as polyp segmentation, which provides critical information for detecting colorectal cancer in its early stages. Autoencoder architectures that learn low-level semantical descriptions of an image are commonly used for semantic segmentation. This architecture is made up of an encoder module that generates low-level data representations, which are then used by a second module (the decoder) that learns to rebuild the initial input. We tackle the semantic segmentation process in this chapter by constructing a novel ensemble of convolutional neural networks (CNNs) and transformers. An ensemble is a machine learning method that trains different models to make predictions on a given input, and then aggregates these predictions to compute a final decision. We enforce ensemble diversity by experimenting with various loss functions and data augmentation approaches. We combine DeepLabV3+, HarDNet-MSEG CNN, and Pyramid Vision Transformers to create the proposed ensemble. We present a thorough empirical analysis of our system based on three semantic segmentation problems: polyp detection, skin detection, and leukocyte recognition. Experiments show that our method produces cutting-edge results.