Discovering Useful Sentence Representations from Large Pre-Trained Language Models
AI/ML Research at Scale AI:
Scale AI is dedicated to accelerating the development of AI applications. As
part of this mission, Scale AI’s growing AI/ML Team aims to meaningfully
contribute to the AI ecosystem with research and resources that can bring
significant and practical benefits to our customers and the broader research
community.
Research and experimentation happens throughout Scale. In some cases, our
contribution is in areas where the bar has already been set high. Some
examples include the open-source datasets we have released with various
partners such as PandaSet,
nuScenes, and
CoNLL Balanced. As our AI/ML team continues to grow, we aim to contribute in entirely new
spaces by publishing cutting edge research papers. In this blog post, we
spotlight a research paper we recently submitted.
The research paper was a collaboration with
Arrive. Arrive links multi-omics and
AI cell characterization to discover hidden patterns in data and improve
outcomes in precision medicine.
Conditioning Unconditional Language Models to Recover Sentences without
Fine-Tuning:
Pretrained language models are used as encoders with widespread success for a
wide variety of NLP tasks. These models learn useful representations that can
be directly applied to certain tasks. Despite the success of these models on
the encoder, there isn't much research out today on whether these large
pretrained language models can be used as universal decoders. To be considered
"universal," a decoder must have an implicit representation for any sentence,
such that it can recover that sentence exactly when conditioned on its
representation.
Task-specific decoders are typically trained from scratch for each task. In
this research, we propose a method to try and learn useful representations of
sentences using large, pretrained, transformer-based decoders. Our work
investigates whether a universal decoder is even possible. If we had access to
such a decoder, we could use it for a wide variety of sequence generation
tasks without having to retrain or rebuild the decoder on new data. This could
open up a new paradigm for low-resource machine translation, summarization,
image captioning, and dialog generation.
The Sentence Space: How Language Models Represent Sentences
Transformer-based language models like ELMo, BERT, and GPT-2 have replaced
word vectors as off-the-shelf general-purpose encoders for natural language
understanding. These models are trained on large amounts of text data and
represent a sentence as a sequence of hidden states that come from a final
layer of the transformer model. Representations in this sentence space are
sequence length dependent, making comparisons between sentences with differing
lengths inequitable and measuring the efficacy of using an unconditional
language model as a universal decoder impossible.
To resolve these issues and to make analysis easier, we propose to
reparametrize the original sentence space into a lower-dimensional and
sentence length agnostic vector space. We do this by adding a bias term to the
fixed language model and finding the representation that minimizes the cross
entropy loss of the sentence. This reparameterization now gives us the ability
to project sentence tokens to representations (sentence encoding) and recover
sentences from the representation (sentence recovery) via the fixed language
model. We develop three representation injection mechanisms and inject biases
at three different locations.
We add a bias Z’ to the embedding, transformer layers, and before the language modeling head in GPT-2. ’SA’ refers to selfattention, ’LN’ to layer normalization, ’FFN’ to a fully-connected layer, and ’LM Head’ to the last fully-connected layer.
Experiments:
Since we are testing the efficacy of a large pretrained language model as a
universal decoder, we measure how well a representation can recover a
sentence when injected into a language model. We measure performance across
sentences from 4 different genres: books, news articles, Wikipedia, and
movie dialogs. We conduct controlled experiments, varying the representation
injection mechanism, representation injection location, initialization, and
dimensionality of the representation.
Results:
We find that our representations recover sentences nearly perfectly even
across genres -- achieving a BLEU score of over 98.
BLEU score performance stratified by genre for different dimensionalities of
Z.
In addition, we observe that the intrinsic dimension of the representation
space is the model’s latent dimension, which indicates that the language
model uses its entire capacity to represent sentences.
Plot of sentence length vs. BLEU score on the dataset.
Additionally, our interpolation experiments reveal that the representation
space has some human-understandable meaning. Our learned representations
seem to have some synonym awareness. In the first sentence pair example
below, the word “tale” transforms to the word “story” and the word “long”
transforms to “long-running” when referring to a war. In the second example,
we observe some syntactic awareness at the 0.7 mixture level. The syntax of
the first sentence is retained with mostly words from the second sentence.
Two linear interpolations between perfectly recovered pairs of
representations. Pink indicates token overlap to the first sentence, while
blue indicates token overlap to the second sentence.
Broader Impact:
This work informs us that we can discover meaningful, sentence-length
agnostic sentence representations, hinting at the possibility of a
“universal” decoder. Such a decoder would improve low-resource sequence
generation task performance and allow for considerable parameter sharing in
memory and data-limited environments.
Conclusion:
For further details on how we set up the experiment, results, and analysis,
you can find the full paper on
arXiv. If you’re interested
in joining our growing machine learning team, you can find all of the open
positions on our