Scale AI Machine Learning Digest - Q3 2020

byon October 7, 2020


The ML Team at Scale hosts a weekly reading groups where members choose papers

from the broad AI/ML community and discuss them ranging from topics in

Computer Vision to NLP to Active Learning. Here, we describe a brief summary

of some of the insights we gained from various papers and how we aim to use

some of that knowledge in future research projects and applications to Scale

AI’s business.

Discovering Useful Sentence Representations from Large Pretrained Language


Presenter:Nishant Subramani

This is Scale’s first research paper and we wrote a

blog post summarizing the paper. This paper focuses on whether we can adapt pre-trained language models

off-the-shelf as universal decoders. To be considered “universal,” a decoder

must have an implicit representation for any target sentence such that it can

recover that sentence exactly when conditioned on its representation. We

investigated whether such representations exist and whether they can be easily

discovered. Experiments show that not only do these representations exist for

sentences from a variety of different genres, but also that our methods can

recover these sentences almost perfectly without fine-tuning the underlying

language model at all.

Scale blog ml papers B_01

Scalable Nearest Neighbor Search for Optimal Transport

Presenter:Anastasiia Alokhina

The authors of this paper look at the problem of nearest neighbor search

with respect to Wasserstein distance. As Wasserstein distance has become

popular in high-resource data environments and domains (images, text, etc.),

search becomes prohibitively slow. As a result, approximation methods are

important. The authors introduce a Quadtree variant called a Flowtree that

formally shows better asymptotic accuracy and empirically show that it beats

existing methods on a variety of different real-world datasets on either

running time or accuracy.

Flowtree is a fast nearest neighbor search algorithm for Optimal Transport

(Wasserstein-1 distance), which has a linear running time and speeds up SOTA

performance by up to 7x. They do this by evaluating the optimal tree flow in

the original metric space.

scale blog ml papers B_02

At Scale AI, we work with lots of high dimensional data such as documents

and images for a variety of tasks. For these, we often need to find good

representation spaces for these data points such that we can do fast and

highly accurate similarity search in order to find outliers in datasets in

order to find errors in labeling and find examples to label to improve our

models using active learning. We think the flowtree method here could be

invaluable in giving us the ability to use a different distance for this end

and is something we will explore going forward.

Structure-from-Motion Revisited

Presenter:Hongbo Tian

Structure-from-Motion(SfM) is a field of computer vision techniques which

aim to reconstruct full 3-dimensional scenes from only a series of images.

Accurate SfM algorithm can reproduce colored point clouds similar to LiDAR.

The authors of this paper along with

Schönberger’s thesis

introduced a robust computational pipeline for accurate dense 3D

reconstruction. The authors also introduced

COLMAP - an open-sourced

CUDA accelerated software to reproduce the results.

Scale blog ml papers B_03

At Scale AI, our labelling pipeline is not subjected to the same online

computational constraints compared to our customers. This allows us to

pursue batch methods to enrich our 2D dataset, enabling our human labelers

to capture 3D contexts from only images.

Representation Learning for Information Extraction from Form-like



Nishant Subramani

The authors aim to extract structured information from form-like documents.

They observe that forms have fields that often correspond to well-understood

types and are often associated with a specific key phrase that bears a

visual relationship with it. They also observe that key phrases in documents

are drawn from a small vocabulary. Due to these observations, the authors

build a system with two parts: candidate generation and scoring &


In the candidate generation phase each field type is associated with a

candidate generator based on a cloud-based entity extraction service. Spans

are then detected. In an invoice there are multiple dates. Every date in an

invoice becomes a candidate for every date field in the target schema. This

process is repeated for every target field.

In the scoring and assignment phase, the goal is to find the correct

extraction candidate for each field. To do this, a score is computed for

each candidate independently using a neural model. Next, the most likely

scored candidate is assigned for each field. This process yields an

independent learned representation based only on the candidates’

neighborhood (this is entirely independent of other candidates of fields).

The neural scoring model is below.

Scale blog ml papers B_04

Our major takeaways are that structured extraction is not super well-studied

by the academic community, but is relatively simple. The extraction system

presented has promising accuracy and generalizes well to different domains.

The representations have some interpretability as well. Ultimately some of

the learnings from this paper could influence some of our work on

Scale Document.

Big Transfer (BiT): General Visual Representation Learning


Felix Lau

This paper focuses on how to transfer pre-trained representations to other

tasks in the visual space. They propose a recipe called Big Transfer (BiT)

and achieve very strong performance on ILSVRC-2012 (87.5% top-1 accuracy),

CIFAR-10 (99.4%), and the Visual Task Adaptation Benchmark (VTAB; 76.3%).

Scale blog ml papers B_05

BiT has two phases: upstream and downstream. In upstream pre-training, the

authors first investigate the scale of computational budget and how that

affects performance. Secondly, they look at ground normalization and weight

standardization, which helps significantly with small-batch and large-batch

training alike. In downstream transfer, the authors propose a cheap

fine-tuning methodology by using the BiT-HyperRule to select the most

important hyperparameters for tuning as a function of the task’s intrinsic

image resolution and number of datapoints. The hyperparameters they thought

were important were training schedule length, resolution, and whether to use

MixUp regularization.

There are a few takeaways from this method we found interesting. There’s a

lot of value in balancing the computational budget and simplifying the

experimental pipeline when needed. The BiT-HyperRule focusing on only a few

hyperparameters was illuminating. We were interested in the dynamics of how

large batches, group normalization, and weight standardization interplayed

and were surprised at how poorly batch normalization performed relative to

group normalization and weight standardization for large batches. The

empirical strength of group normalization and weight standardization

surprised us as well and we will start exploring these techniques in some of

our computer vision models with lots of data available. However, the most

impressive part was the empirical few-shot learning performance relative to

the baseline. We are trying to build models that quickly adapt to new

customers and domains and often struggle with fine-tuning, so these ideas

are very top of mind.

Learning Loss for Active Learning


Rishab Goyal

The authors present an active learning method that is a task-agnostic by

attaching a loss prediction module to a target network and train it to be

able to predict target losses of unlabeled inputs. The module can then

provide insight into data that the model could predict incorrectly and thus

improve the overall model by choosing just those examples to go and label.

Scale blog ml papers B_06

Their general process is that for every labeled example you predict the

target and predicted loss for your prediction. Next, you take the unlabeled

pool and pass them through the model to get predicted losses. Lastly you

annotate the top-k data points and add them to the labeled training set. The

loss prediction module uses a ranking loss.

As a data labeling company, we are very interested in active learning and

how we can leverage our own strengths to better build ML models. We really

liked the approach because it is widely applicable, but were concerned that

there could be many degenerate cases. We weren’t convinced that the loss

prediction module would be strong enough to offer significant improvements

over uncertainty or coverage based methods such as entropy,

query-by-committee, and coreset based methods.

LayoutLM: Pre-training of Text and Layout for Document Image



Malcolm Greaves

The authors try and solve the problem of automated understanding from

scanned business documents. They use both computer vision and natural

language processing techniques via textual and visual layout information for

pre-training their system. Their system achieves SOTA on a set of layout

analysis, receipt understanding, and document image classification tasks.

They claim that their empirical performance is from their novel loss

functions: a masked visual-language model (MVLM) loss and a multi-label

document classification (MDC) loss.

Scale blog ml papers B_07

MVLM extends the masked language model loss from BERT by randomly choosing

some words to mask. They then keep the masked token’s 2D positional

embedding and forces the model to predict the word’s text. MVLM forces the

model to understand language context and 2D relationships between words in

the document. MDC assumes that documents have possibly several labels and is

a standard cross-entropy loss. This is optional and seems to help on the

document image classification tasks.

We found this paper interesting, but computationally demanding: alongside

their loss functions, a key contribution of this paper is their pre-training

routine. Without significant pre-training, these models do not seem to work

well. This work requires off-the-shelf OCR and treats that as ground truth,

which simplifies the work, but makes it less practical for document analysis

practitioners. For the business document tasks that our ML team takes on, we

often like to have greater control on the OCR task. It's sometimes

advantageous to be able to adjust OCR to provide better downstream

task-dependent results. We do a lot of document analysis at Scale, but can’t

obfuscate or explain away the OCR component, which do produce some errors

(e.g. with handwriting). Although this approach needs word-level

annotations, Scale is in a unique position to get large amounts of these

labor-intensive labels using our labeling platform.

Simplifying Models with Unlabeled Output Data


Alexandre Matton

This paper focuses on tasks whose outputs have to obey some constraints.

This happens in pseudo-code to code transcription for example. What we

ultimately want is a code that compiles and runs without error. The usual

way to address these tasks is to build a single end-to-end model and hope

that it will learn the input-output mapping and understand the output

constraints. The authors provide a new way to tackle these tasks by

replacing the original model by two sub-models: a base predictor and a

denoiser. The goal of the base predictor is to learn the mapping between the

input and the output. The goal of the denoiser is to learn the constraints

of the output space, in order to fix the base predictor’s output.

Scale blog ml papers B_08

There is a reason why this decomposition works well. As the denoiser

simplifies the job of the base predictor, it is now possible to increase the

regularization of the base predictor without affecting the overall model

performance. This in turn leads to solutions that generalize better on

unseen data. The big win of this architecture is that the denoiser can be

trained on unlabelled output data, which is often abundant (e.g. there is a

huge amount of compiling code pieces freely available on github).

A big number of tasks actually impose constraints on the output (such as

molecule generation, or translation from one language to another). At Scale,

our ML team also tackles such problems. This paper provides a very general

way to increase our downstream scores on them.

Join our Team

If you’re interested in joining our growing engineering team, take a look at

our careers page for open positions.

The future of your industry starts here.