Introduction
The ML Team at Scale hosts a weekly reading groups where members choose papers
from the broad AI/ML community and discuss them ranging from topics in
Computer Vision to NLP to Active Learning. Here, we describe a brief summary
of some of the insights we gained from various papers and how we aim to use
some of that knowledge in future research projects and applications to Scale
AI’s business.
Discovering Useful Sentence Representations from Large Pretrained Language
Models
Presenter:Nishant Subramani
This is Scale’s first research paper and we wrote a
blog post summarizing the paper. This paper focuses on whether we can adapt pre-trained language models
off-the-shelf as universal decoders. To be considered “universal,” a decoder
must have an implicit representation for any target sentence such that it can
recover that sentence exactly when conditioned on its representation. We
investigated whether such representations exist and whether they can be easily
discovered. Experiments show that not only do these representations exist for
sentences from a variety of different genres, but also that our methods can
recover these sentences almost perfectly without fine-tuning the underlying
language model at all.
Scalable Nearest Neighbor Search for Optimal Transport
Presenter:Anastasiia Alokhina
The authors of this paper look at the problem of nearest neighbor search
with respect to Wasserstein distance. As Wasserstein distance has become
popular in high-resource data environments and domains (images, text, etc.),
search becomes prohibitively slow. As a result, approximation methods are
important. The authors introduce a Quadtree variant called a Flowtree that
formally shows better asymptotic accuracy and empirically show that it beats
existing methods on a variety of different real-world datasets on either
running time or accuracy.
Flowtree is a fast nearest neighbor search algorithm for Optimal Transport
(Wasserstein-1 distance), which has a linear running time and speeds up SOTA
performance by up to 7x. They do this by evaluating the optimal tree flow in
the original metric space.
At Scale AI, we work with lots of high dimensional data such as documents
and images for a variety of tasks. For these, we often need to find good
representation spaces for these data points such that we can do fast and
highly accurate similarity search in order to find outliers in datasets in
order to find errors in labeling and find examples to label to improve our
models using active learning. We think the flowtree method here could be
invaluable in giving us the ability to use a different distance for this end
and is something we will explore going forward.
Structure-from-Motion Revisited
Presenter:Hongbo Tian
Structure-from-Motion(SfM) is a field of computer vision techniques which
aim to reconstruct full 3-dimensional scenes from only a series of images.
Accurate SfM algorithm can reproduce colored point clouds similar to LiDAR.
The authors of this paper along with
introduced a robust computational pipeline for accurate dense 3D
reconstruction. The authors also introduced
COLMAP - an open-sourced
CUDA accelerated software to reproduce the results.
At Scale AI, our labelling pipeline is not subjected to the same online
computational constraints compared to our customers. This allows us to
pursue batch methods to enrich our 2D dataset, enabling our human labelers
to capture 3D contexts from only images.
Representation Learning for Information Extraction from Form-like
Documents
Presenter:
The authors aim to extract structured information from form-like documents.
They observe that forms have fields that often correspond to well-understood
types and are often associated with a specific key phrase that bears a
visual relationship with it. They also observe that key phrases in documents
are drawn from a small vocabulary. Due to these observations, the authors
build a system with two parts: candidate generation and scoring &
assignment.
In the candidate generation phase each field type is associated with a
candidate generator based on a cloud-based entity extraction service. Spans
are then detected. In an invoice there are multiple dates. Every date in an
invoice becomes a candidate for every date field in the target schema. This
process is repeated for every target field.
In the scoring and assignment phase, the goal is to find the correct
extraction candidate for each field. To do this, a score is computed for
each candidate independently using a neural model. Next, the most likely
scored candidate is assigned for each field. This process yields an
independent learned representation based only on the candidates’
neighborhood (this is entirely independent of other candidates of fields).
The neural scoring model is below.
Our major takeaways are that structured extraction is not super well-studied
by the academic community, but is relatively simple. The extraction system
presented has promising accuracy and generalizes well to different domains.
The representations have some interpretability as well. Ultimately some of
the learnings from this paper could influence some of our work on
Big Transfer (BiT): General Visual Representation Learning
Presenter:
This paper focuses on how to transfer pre-trained representations to other
tasks in the visual space. They propose a recipe called Big Transfer (BiT)
and achieve very strong performance on ILSVRC-2012 (87.5% top-1 accuracy),
CIFAR-10 (99.4%), and the Visual Task Adaptation Benchmark (VTAB; 76.3%).
BiT has two phases: upstream and downstream. In upstream pre-training, the
authors first investigate the scale of computational budget and how that
affects performance. Secondly, they look at ground normalization and weight
standardization, which helps significantly with small-batch and large-batch
training alike. In downstream transfer, the authors propose a cheap
fine-tuning methodology by using the BiT-HyperRule to select the most
important hyperparameters for tuning as a function of the task’s intrinsic
image resolution and number of datapoints. The hyperparameters they thought
were important were training schedule length, resolution, and whether to use
MixUp regularization.
There are a few takeaways from this method we found interesting. There’s a
lot of value in balancing the computational budget and simplifying the
experimental pipeline when needed. The BiT-HyperRule focusing on only a few
hyperparameters was illuminating. We were interested in the dynamics of how
large batches, group normalization, and weight standardization interplayed
and were surprised at how poorly batch normalization performed relative to
group normalization and weight standardization for large batches. The
empirical strength of group normalization and weight standardization
surprised us as well and we will start exploring these techniques in some of
our computer vision models with lots of data available. However, the most
impressive part was the empirical few-shot learning performance relative to
the baseline. We are trying to build models that quickly adapt to new
customers and domains and often struggle with fine-tuning, so these ideas
are very top of mind.
Learning Loss for Active Learning
Presenter:
The authors present an active learning method that is a task-agnostic by
attaching a loss prediction module to a target network and train it to be
able to predict target losses of unlabeled inputs. The module can then
provide insight into data that the model could predict incorrectly and thus
improve the overall model by choosing just those examples to go and label.
Their general process is that for every labeled example you predict the
target and predicted loss for your prediction. Next, you take the unlabeled
pool and pass them through the model to get predicted losses. Lastly you
annotate the top-k data points and add them to the labeled training set. The
loss prediction module uses a ranking loss.
As a data labeling company, we are very interested in active learning and
how we can leverage our own strengths to better build ML models. We really
liked the approach because it is widely applicable, but were concerned that
there could be many degenerate cases. We weren’t convinced that the loss
prediction module would be strong enough to offer significant improvements
over uncertainty or coverage based methods such as entropy,
query-by-committee, and coreset based methods.
LayoutLM: Pre-training of Text and Layout for Document Image
Understanding
Presenter:
The authors try and solve the problem of automated understanding from
scanned business documents. They use both computer vision and natural
language processing techniques via textual and visual layout information for
pre-training their system. Their system achieves SOTA on a set of layout
analysis, receipt understanding, and document image classification tasks.
They claim that their empirical performance is from their novel loss
functions: a masked visual-language model (MVLM) loss and a multi-label
document classification (MDC) loss.
MVLM extends the masked language model loss from BERT by randomly choosing
some words to mask. They then keep the masked token’s 2D positional
embedding and forces the model to predict the word’s text. MVLM forces the
model to understand language context and 2D relationships between words in
the document. MDC assumes that documents have possibly several labels and is
a standard cross-entropy loss. This is optional and seems to help on the
document image classification tasks.
We found this paper interesting, but computationally demanding: alongside
their loss functions, a key contribution of this paper is their pre-training
routine. Without significant pre-training, these models do not seem to work
well. This work requires off-the-shelf OCR and treats that as ground truth,
which simplifies the work, but makes it less practical for document analysis
practitioners. For the business document tasks that our ML team takes on, we
often like to have greater control on the OCR task. It's sometimes
advantageous to be able to adjust OCR to provide better downstream
task-dependent results. We do a lot of document analysis at Scale, but can’t
obfuscate or explain away the OCR component, which do produce some errors
(e.g. with handwriting). Although this approach needs word-level
annotations, Scale is in a unique position to get large amounts of these
labor-intensive labels using our labeling platform.
Simplifying Models with Unlabeled Output Data
Presenter:
This paper focuses on tasks whose outputs have to obey some constraints.
This happens in pseudo-code to code transcription for example. What we
ultimately want is a code that compiles and runs without error. The usual
way to address these tasks is to build a single end-to-end model and hope
that it will learn the input-output mapping and understand the output
constraints. The authors provide a new way to tackle these tasks by
replacing the original model by two sub-models: a base predictor and a
denoiser. The goal of the base predictor is to learn the mapping between the
input and the output. The goal of the denoiser is to learn the constraints
of the output space, in order to fix the base predictor’s output.
There is a reason why this decomposition works well. As the denoiser
simplifies the job of the base predictor, it is now possible to increase the
regularization of the base predictor without affecting the overall model
performance. This in turn leads to solutions that generalize better on
unseen data. The big win of this architecture is that the denoiser can be
trained on unlabelled output data, which is often abundant (e.g. there is a
huge amount of compiling code pieces freely available on github).
A big number of tasks actually impose constraints on the output (such as
molecule generation, or translation from one language to another). At Scale,
our ML team also tackles such problems. This paper provides a very general
way to increase our downstream scores on them.
Join our Team
If you’re interested in joining our growing engineering team, take a look at
our careers page for open positions.