A Multi-task model (MTM) learns specific features using shared and task specific layers among different tasks, an approach that turned out to be effective in those tasks where limited data is available to train the model. In this research, we utilize this characteristic of MTM using knowledge distillation to enhance the performance of a single task model (STM). STMs have difficulties in learning complex feature representations from a limited amount of annotated data. Distilling knowledge from MTM will help STM to learn more complex feature representations during the training phase. We use feature representations from different layers of a MTM to teach the student model during its training. Our approach shows distinguishable improvements in terms of F1-score with respect to STM. We further performed a statistical analysis to investigate the effect of different teacher models on different student models. We found that a Softmax-based teacher model is more effective for token level knowledge distillation than a CRF-based teacher model.
Knowledge Distillation with Teacher Multi-task Model for Biomedical Named Entity Recognition
Mehmood T.;Serina I.;Gerevini A.
2021-01-01
Abstract
A Multi-task model (MTM) learns specific features using shared and task specific layers among different tasks, an approach that turned out to be effective in those tasks where limited data is available to train the model. In this research, we utilize this characteristic of MTM using knowledge distillation to enhance the performance of a single task model (STM). STMs have difficulties in learning complex feature representations from a limited amount of annotated data. Distilling knowledge from MTM will help STM to learn more complex feature representations during the training phase. We use feature representations from different layers of a MTM to teach the student model during its training. Our approach shows distinguishable improvements in terms of F1-score with respect to STM. We further performed a statistical analysis to investigate the effect of different teacher models on different student models. We found that a Softmax-based teacher model is more effective for token level knowledge distillation than a CRF-based teacher model.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.