Computer vision methods under rapid evolution for pathology image tasks

Nigel Maher; Richard A Scolyer; Sidong Liu

doi:10.1111/his.15352

Abstract

9 min read

There have been rapid advances in the artificial intelligence (AI) pathology field within the last decade, evidenced by the exponential rise in publications in this field from 2016,1 bolstered by the Food and Drug Administration (FDA) approval in 2017 of the first digital whole-slide scanner (https://www.fda.gov/news-events/press-announcements/fda-allows-marketing-first-whole-slide-imaging-system-digital-pathology, accessed 13 June 2024). While there is immense anticipation for the potential roles and benefits of using AI in pathology, there is also a healthy amount of apprehension concerning the potential for errors, which was well discussed by Evans and Snead in their recent Histopathology paper.2 We believe that recent methodological advances in computer vision relevant to histopathology will help to contribute to minimising these AI-related errors by making it more feasible for pathologist involvement, facilitating use of large data sets, reducing annotation labour and providing generalisable foundation models that can be readily adapted to specific tasks. It is therefore important that pathologists are not only aware of, but also understand the concepts behind these leading methods. Until recently, the majority of deep-learning classifiers for histopathology tasks have been trained using convolutional neural networks (CNNs) on slide annotations (‘patch labelling’) performed by pathologists (i.e. fully supervised learning). CNNs have demonstrated excellent performance for image segmentation (i.e. delineation of structures within images) and classification tasks.3, 4 The basic architecture of a CNN is (i) a convolutional layer, (ii) a pooling layer and (iii) a fully connected layer (Figure 1). Convolutional and pooling layers are typically repeated in tandem. The convolutional layer extracts features from the image (e.g. edges and shapes) via a multiplication filter matrix passing over the image (by generating new numerical outputs represented as a grid of numbers, referred to as a feature map), on the basis that colours in images can be represented by numbers. The pooling layer formulaically reduces the dimensionality—number of outputs—from the convolutional layer, while the fully connected layer integrates all the outputs from the previous layer, before a final layer makes the classification prediction based on these distilled features extracted from the original image. CNN architectures (e.g. ResNet, Inception, VGG) have been typically pretrained on large image agnostic data sets (e.g. animals and other miscellaneous objects in the ImageNet data set5) before being trained for histopathology tasks. This pretraining helps CNNs to efficiently extract relevant features from histopathology images during the convolutional layer. Nonetheless, there is usually still a need to use hundreds of annotated pathology images to train a task-specific classifier with this approach. While expertly annotated histopathology data are critical for ground-truth in training and testing it is also prone to bias and error,2 and requires significant amounts of labour time. To reduce the dependance on large volumes of annotations, and to provide better out-of-the-box performance for a variety of histopathology image analysis tasks, significant recent efforts have been made to pretrain deep learning models using large volumes of histopathology images.6-9 To achieve pretraining using large-volume histopathology data sets that may exceed more than a million whole slide images (WSIs),10 where pathologist annotation is not realistically feasible, self-supervised learning techniques have been adopted to learn features from the images. Self-supervised learning does not require pathologist involvement; instead, such models generate supervisory signals by using features from the image itself, such as by applying contrastive learning (to generate features between like and dislike patch pairs within the image) or through reconstructive learning (by subtracting patches within the image to use as the ground truth, while the model learns to fill in the subtracted patches appropriately).11 Self-supervised learning has been augmented in some cases by multimodal data (such as pathology reports) to bolster supervision and potentially enhance downstream applications, such as image-language/language-image conversions.8, 9, 12 More recent self-supervised learning approaches have tended to adopt vision transformer (ViT) models (e.g. using DINOv2) in preference to exclusively CNN based models.13 Akin to how popular natural language processing models understand the importance of words and word order within a sentence, ViTs break up images into patches (the words), flatten them out into one dimension (akin to forming a sentence) and provide them with positional information (denoting the order of words within the sentence) (Figure 2).14 ViTs utilise a transformer architecture which has attention mechanisms to understand how the image patches relate to each other within the global context (i.e. in the ‘sentence’ and ‘paragraph’),14 which is in contrast to CNNs that hierarchically extract features from images without taking into account spatially distant image features (Figure 1). With the relatively newfound ability to successfully pretrain models from large volumes of WSIs from a variety of organ systems, the development of computational pathology foundation models has emerged. This is a game-changing leap forward, as these pathology domain-specific, task-agnostic models can be more readily adapted for task-specific functions due to their reduced need for task-specific training examples,7 owing to their ability to extract meaningful features from histopathology images. This is appealing to pathologists for numerous reasons, including reduced training labour, overcoming scarcity of quality data and being readily able to adapt models to changing clinical needs and treatment advances. In 2024, two WSI foundation models were announced (Prov-GigaPath9 and PRISM8) in addition to advanced pretrained frameworks for building other WSI foundation models (TANGLE15 and PANTHER16). WSI foundation models have been trained to extract features at the WSI level (i.e. compact representations of the WSI), which is of value for providing slide level diagnoses, as opposed to earlier pathology foundation models that only extract features from patches/regions of interest on the slide. Enticingly, performance of PRISM without task-specific training slightly outperformed a fully supervised non-pretrained model for differentiating invasive lobular carcinoma from invasive ductal carcinoma of the breast, as well as ductal carcinoma in-situ from invasive ductal carcinoma.8 Helpfully, the codes for Prov-GigaPath,9 TANGLE,15 PANTHER16 and PRISM8 are publicly available via the GitHub or Hugging Face repositories, facilitating community evaluation and usage. Additionally, multiple-instance learning (MIL) frameworks have gained rapid popularity over the last few years for analysing weakly supervised digital pathology data (e.g. slides labelled only at the WSI level without annotation).17 With MIL, various groups of patches (instances/focal regions within the image) are selected into different bags from each slide, then features from the patches are extracted (typically using a CNN, and more recently by using a pathology foundation model) and the features pooled together to provide a bag label representation, which is then used to assign a probability of that bag corresponding to the WSI label (Figure 3).17 A variety of MIL frameworks have been developed, and differ according to the strategies used for patch selection from the WSI and how to represent the patch features in the bags and the techniques for pooling and feature extraction.17 Non-WSI pathology foundation models can use MIL frameworks to perform WSI level classification tasks, in addition to boosting the performance of MIL models compared to those pretrained with non-pathology images.7 Importantly, by not requiring image annotation, MIL facilitates the use of larger data sets, an essential feature for building clinically reliable AI-classifiers. The main disadvantages of using weakly supervised MIL-based approaches compared to a fully supervised approach include requiring much more data for training,18 and the explainability of their predictions generated (as MIL heat-maps generally show the areas of attention placed by the models in the image for their predictions, as opposed to showing tumour/feature probabilities for the areas within the image, that are generated after using annotated slide data). As well as MIL and foundation models, innovations in annotation techniques to reduce the labour burden on pathologists include prototype learning to extrapolate annotations automatically based on an initial annotation set, multiplex immunohistochemistry for cell phenotyping and spatially resolved genomic assays. The latter two methods are also objective labelling techniques that may offer high precision labelling within the image, but unfortunately still require intensive scientific labour, expensive equipment and are restrictive for using on large amounts of WSIs. Finally, pathologist engagement with AI-based pathology tools is also set to change. As well as commercial solutions, freely available, more user-friendly software is being developed that enables pathologists to engage more easily with training AI-classifiers, such as end-to-end packages (e.g. that include stain colour calibration, artefact processing, framework selection and explainability outputs),19 and via graphical user interfaces that reduce reliance upon computer programming expertise. Furthermore, the first pathology AI copilot was published in June 2024, PathChat, which enables open-ended question-and-answer in text or visual format.20 This represents a domain-specific evolution from agnostic AI copilot models such as ChatGPT, which was recently evaluated in Histopathology by Oon et al.21 Going forward, we expect that the methodology used to develop AI-based histopathology classifiers will look dramatically different, with reduced barriers for participation by pathologists. It will be interesting to observe the real-world performance and requirements of AI-based classifiers trained using pathology foundation models among geographic populations and tumour/disease types. Adherence to using appropriate reporting guidelines for AI studies will be important for this purpose.22 Of course, many questions remain and new questions arise. For example, does the number of WSIs and relative contribution from different organ systems or pathologies used for foundation model development affect performance when applied to specific tasks? Recent research benchmarking pathology foundation models offers some insight into this already.13, 23 We are indeed in an era of significant change for histopathology, and keeping updated with the progress of AI methods for histopathology tasks will not only identify opportunities for new research projects and greater involvement from pathologists, but also allow for greater participation from pathologists in its regulation and in the minimisation of AI-related errors. N.M. is a recipient of a Postgraduate Research Fellowship 2022 from the Royal College of Pathologists of Australasia Foundation, a Research Training Program stipend scholarship from the University of Sydney/Australian Federal Government, a Melanoma Institute Australia Postgraduate Research scholarship and an Australian Melanoma Research Foundation research grant. R.A.S. is supported by an NHMRC Investigator Grant. S.L. is supported by an NHMRC Ideas Grant. We thank Dr Ismael Vergara for his proofreading of this manuscript. Support from colleagues at Melanoma Institute Australia, Royal Prince Alfred Hospital and the Australian Institute of Health Innovation are also gratefully acknowledged. N.M. is a recipient of a Postgraduate Research Fellowship 2022 from the Royal College of Pathologists of Australasia Foundation, a Research Training Program stipend scholarship from the University of Sydney/Australian Federal Government, a Melanoma Institute Australia Postgraduate Research scholarship and an Australian Melanoma Research Foundation research grant. R.A.S. is supported by an NHMRC Investigator Grant. S.L. is supported by an NHMRC Ideas Grant. R.A.S. has received fees for professional services from F. Hoffmann-La Roche Ltd, Evaxion, Provectus Biopharmaceuticals Australia, Qbiotics, Novartis, MSD Sharp & Dohme, NeraCare, AMGEN Inc., Bristol-Myers Squibb, Myriad Genetics, GlaxoSmithKline, SkylineDx BV, IO Biotech ApS and MetaOptima Technology Inc. N.M. and S.L. declare no financial or non-financial competing interests. Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

Computer vision methods under rapid evolution for pathology image tasks

Abstract

Discussion(0)

Related publications

A deep learning framework for efficient pathology image analysis

Benchmarking foundation models as feature extractors for weakly-supervised computational pathology

Benchmarking foundation models as feature extractors for weakly supervised computational pathology

Graph-based multiple-instance learning for object-based image retrieval

A weakly supervised deep learning framework for automated PD-L1 expression analysis in lung cancer

Related publications

Preprint2025
A deep learning framework for efficient pathology image analysis
Preprint2025

Preprint2024
Benchmarking foundation models as feature extractors for weakly-supervised computational pathology
Preprint2024

Article2025
Benchmarking foundation models as feature extractors for weakly supervised computational pathology
Article2025

Article2008
Graph-based multiple-instance learning for object-based image retrieval
Article2008

Article2025
A weakly supervised deep learning framework for automated PD-L1 expression analysis in lung cancer
Article2025