Engineering

Discovering Useful Sentence Representations from Large Pre-Trained Language Models

AI/ML Research at Scale AI:

 

Scale AI is dedicated to accelerating the development of AI applications. As

part of this mission, Scale AI’s growing AI/ML Team aims to meaningfully

contribute to the AI ecosystem with research and resources that can bring

significant and practical benefits to our customers and the broader research

community.

 

Research and experimentation happens throughout Scale. In some cases, our

contribution is in areas where the bar has already been set high. Some

examples include the open-source datasets we have released with various

partners such as PandaSet,

nuScenes, and

CoNLL Balanced. As our AI/ML team continues to grow, we aim to contribute in entirely new

spaces by publishing cutting edge research papers. In this blog post, we

spotlight a research paper we recently submitted.

 

The research paper was a collaboration with

Nivedita Suresh of

Arrive. Arrive links multi-omics and

AI cell characterization to discover hidden patterns in data and improve

outcomes in precision medicine.

 

Conditioning Unconditional Language Models to Recover Sentences without

Fine-Tuning:

 

Pretrained language models are used as encoders with widespread success for a

wide variety of NLP tasks. These models learn useful representations that can

be directly applied to certain tasks. Despite the success of these models on

the encoder, there isn't much research out today on whether these large

pretrained language models can be used as universal decoders. To be considered

"universal," a decoder must have an implicit representation for any sentence,

such that it can recover that sentence exactly when conditioned on its

representation.

 

Task-specific decoders are typically trained from scratch for each task. In

this research, we propose a method to try and learn useful representations of

sentences using large, pretrained, transformer-based decoders. Our work

investigates whether a universal decoder is even possible. If we had access to

such a decoder, we could use it for a wide variety of sequence generation

tasks without having to retrain or rebuild the decoder on new data. This could

open up a new paradigm for low-resource machine translation, summarization,

image captioning, and dialog generation.

 

The Sentence Space: How Language Models Represent Sentences

 

Transformer-based language models like ELMo, BERT, and GPT-2 have replaced

word vectors as off-the-shelf general-purpose encoders for natural language

understanding. These models are trained on large amounts of text data and

represent a sentence as a sequence of hidden states that come from a final

layer of the transformer model. Representations in this sentence space are

sequence length dependent, making comparisons between sentences with differing

lengths inequitable and measuring the efficacy of using an unconditional

language model as a universal decoder impossible.

 

To resolve these issues and to make analysis easier, we propose to

reparametrize the original sentence space into a lower-dimensional and

sentence length agnostic vector space. We do this by adding a bias term to the

fixed language model and finding the representation that minimizes the cross

entropy loss of the sentence. This reparameterization now gives us the ability

to project sentence tokens to representations (sentence encoding) and recover

sentences from the representation (sentence recovery) via the fixed language

model. We develop three representation injection mechanisms and inject biases

at three different locations.

We add a bias Z’ to the embedding, transformer layers, and before the language modeling head in GPT-2. ’SA’ refers to selfattention, ’LN’ to layer normalization, ’FFN’ to a fully-connected layer, and ’LM Head’ to the last fully-connected layer.

 

 

Experiments:

 

Since we are testing the efficacy of a large pretrained language model as a

universal decoder, we measure how well a representation can recover a

sentence when injected into a language model. We measure performance across

sentences from 4 different genres: books, news articles, Wikipedia, and

movie dialogs. We conduct controlled experiments, varying the representation

injection mechanism, representation injection location, initialization, and

dimensionality of the representation.

 

Results:

 

We find that our representations recover sentences nearly perfectly even

across genres -- achieving a BLEU score of over 98.

 

 

BLEU Scores

BLEU score performance stratified by genre for different dimensionalities of

Z.

 

 

 

In addition, we observe that the intrinsic dimension of the representation

space is the model’s latent dimension, which indicates that the language

model uses its entire capacity to represent sentences.

 

 

Sentence length v. BLEU score

Plot of sentence length vs. BLEU score on the dataset.

 

 

 

Additionally, our interpolation experiments reveal that the representation

space has some human-understandable meaning. Our learned representations

seem to have some synonym awareness. In the first sentence pair example

below, the word “tale” transforms to the word “story” and the word “long”

transforms to “long-running” when referring to a war. In the second example,

we observe some syntactic awareness at the 0.7 mixture level. The syntax of

the first sentence is retained with mostly words from the second sentence.

 

 

Linear interpolations between perfectly recovered pairs.

Two linear interpolations between perfectly recovered pairs of

representations. Pink indicates token overlap to the first sentence, while

blue indicates token overlap to the second sentence.

 

 

Broader Impact:

 

This work informs us that we can discover meaningful, sentence-length

agnostic sentence representations, hinting at the possibility of a

“universal” decoder. Such a decoder would improve low-resource sequence

generation task performance and allow for considerable parameter sharing in

memory and data-limited environments.

 

Conclusion:

 

For further details on how we set up the experiment, results, and analysis,

you can find the full paper on

arXiv. If you’re interested

in joining our growing machine learning team, you can find all of the open

positions on our

careers page.

 


The future of your industry starts here.