On the Behaviour of BERT’s Attention for the Classification of Medical Reports

Putelli, L.; Gerevini, A. E.; Lavelli, A.; Mehmood, T.; Serina, I.

Since BERT and the other Transformer-based models have been proved successful in many NLP tasks, several studies have been conducted to understand why these complex deep learning architectures are able to reach such remarkable results. Such studies have focused on visualising and analysing the behaviour of each self-attention mechanism and are often conducted with large, generic and annotated datasets for the English language, using supervised probing tasks in order to test specific linguistic capabilities. However, in several practical contexts there are some difficulties: probing tasks may not be available, the documents can contain a strict technical lexicon, and the datasets can be noisy. In this work we analyse the behaviour of BERT in a specific context, i.e. the classification of radiology reports collected from an Italian hospital. We propose (i) a simplified way to classify head patterns without relying on probing tasks or manual observations, and (ii) an algorithm for extracting the most relevant relations among words captured by each self-attention. Combining these techniques with manual observations, we present several examples of linguistic information that can be extracted from BERT in our application.