SenseBERT: Driving some sense into BERT
We enrich the self-supervision strategy employed in BERT by applying self-supervision on the Word-Sense level. Our model, SenseBERT, achieves significantly improved lexical disambiguation abilities, setting state-of-the-art results on the Word in Context (WiC) task.
Self-supervision for Pre-trained Language Models
The use of self-supervision has allowed neural language models to advance the frontier in Natural Language Understanding. Specifically, the self-supervision strategy employed in BERT, which involves masking some of the words in an input sentence and training the model to predict them given their context, results in a versatile language model suited for many tasks. Nevertheless, existing self-supervision methods operate on the word-form level, and as such, they are inherently limited. Take the word ‘bass’ as an example: in one context it can refer to a fish, in a second to a guitar, in the third to a type of singer, and so on. The word itself is merely a surrogate of its actual meaning in a given context - referred to as its sense.
Word-Sense predictions made by SenseBERT on a raw input text.
The word "Bass" maps to its different supersenses according to context.
SenseBERT - Pre-training on Word Senses
We propose a method to employ self-supervision directly at the word sense level. Our model, named SenseBERT, is pre-trained to predict not only the masked words (as in BERT) but also their WordNet supersenses. BERT’s current word-form level objective leads to what is commonly referred to as a language model: the model predicts words for the given masked position in a sentence. Our added task allows SenseBERT to gather word-sense level statistics and thereby we effectively train a semantic-level language model that predicts the missing word’s meaning jointly with the standard word-form level prediction.
Self-supervision allows neural language models to learn from massive amounts of unannotated text. In order to enjoy this advantage at the lexical-semantic level, we make use of WordNet - an expert-constructed ontology that provides an inventory of word senses. We focus on a coarse-grained variant of word senses, referred to as supersenses, divided into 45 types that semantically categorize the entire language.
Labeling of words with a single supersense, for example the word ‘sword’ which has only the supersense ‘artifact’, is straightforward: We train the network to predict this supersense given the masked word’s context. As for words with multiple supersenses (like the word ‘bass’ we discussed earlier), we train the model to predict any of these senses, leading to a simple yet effective labeling scheme.
Other than our additional pre-training task and the architecture changes that support it, we followed the architecture and training methods published in the BERT paper. Specifically, we trained models in two sizes, similar to BERT model sizes - SenseBERTBASE and SenseBERTLARGE. Our results below show significantly enhanced word meaning understanding abilities of SenseBERT, achieved without human annotation.
Word-Sense predictions made by SenseBERT on the masked word in the sentence.
In order to test our model’s lexical semantic abilities, we compared it with BERT on two tasks:
1. SemEval Word Superense Disambiguation
We constructed a supersense variant of the SemEval-based WSD task, where the goal is to predict a word’s supersense given context.
We trained and tested in two different schemes:
A “Frozen” model with a linear classifier on top (to explore the sense information captured in the model’s pre-trained embeddings).
A “Fine-Tuned” model, where the network weights have been modified during training.
SenseBERTBASE outscores both BERTBASE and BERTLARGE models by a large margin in both cases, and SenseBERTLARGE yields even higher results.
2. Word in Context (WiC)
The Word in Context (WiC) task, from the SuperGLUE benchmark, is the task of predicting whether a word has the same meaning in two given sentences. This task heavily depends on word-supersense awareness.
SenseBERTBASE surpasses BERTLARGE, and a single SenseBERTLARGE model achieves state-of-the-art performance on this task with a score of 72.14.