Scale AI Machine Learning Digest - Q1 2020

byon April 6, 2020

Machine learning is a field of study with tremendous strides being made

through active academic research. The machine learning team at Scale AI

regularly hosts reading groups to discuss papers they find interesting to give

us context in a rapidly evolving, research-heavy field. In this blog post, we

go over the papers the team has read throughout the quarter and provide

insights on how a paper influenced our own work here at Scale AI when


The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Authors: Jonathan Frankle, Michael Carbin

Presenter:Hongbo Tian

Pruned neural network architectures created by removing unnecessary weights

from trained dense neural networks can attain various benefits without

sacrificing accuracy. For example, smaller architectures have lower storage

requirements, and certain hardware architectures can take advantage of

sparsity, leading to massive computational performance gains at inference

time. However, training the pruned network from scratch (as opposed to

training the full network and pruning) has historically achieved poor


This paper presents the hypothesis that dense neural networks will contain

some subnetwork that can reach the same performance of the original network

when trained in isolation, and the authors cover some techniques to uncover

the “winning subnetwork.” Larger network architectures have typically been

considered superior, but this paper demonstrates that such a belief is not

necessarily true. The lottery ticket conjecture matters when developing

machine learning models here at Scale AI, since we know that we can build

networks that tackle ambitious challenges without being blocked on

unreasonable monetary or compute requirements.

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Authors: Julian Schrittwieser, Ioannis Antonoglou, Thomas

Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward

Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, David Silver

Presenter:Hongbo Tian

MuZero’s performance, compared to the state of the art. After sufficient

training, MuZero’s Elo (denoted by the blue lines) surpasses that of

AlphaZero for boardgames and R2D2 in Atari.

Games like chess and Go tend to have deterministic and well-defined rules.

Such games aren’t representative of real world problems because most realistic

problems occur in an environment with complex or unknown rules. This paper

presents MuZero, a successor to AlphaZero that matches AlphaZero’s performance

in Go but can learn to play Atari games, all without any prior knowledge to

the game rules. MuZero builds off of AlphaZero’s search algorithms, but it

also uses a learned environment model when training to capture aspects of the

future that are relevant for planning.

Although MuZero don’t know the game rule or environmental dynamics of the

challenges it faces, it matches AlphaZero in logically complex games and beats

the state-of-the-art model-free reinforcement learning algorithms in visually

complex games (e.g. Atari games). Such a feat makes MuZero a pioneer in

developing AI to solve realistic problems where there’s no perfect simulator

and demonstrates the feasibility of applying learning methods to more

ambiguous problems.

FixMatch: Simplifying Semi-Supervised Learning with Consistency and


Authors: Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao

Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, Colin Raffel

Presenter:Felix Lau

The general architecture of FixMatch. The model is trained to make

predictions on a strongly-augmented version of an unlabeled image, based off

the prediction results of a weakly-augmented version of the same image.

The authors combine two common semi-supervised learning methods, consistency

regularization and pseudo-labeling, to create a new algorithm that achieves

state-of-the-art performance on standard, semi-supervised learning benchmarks.

For unlabeled data, the algorithm applies both strong augmentation and weak

augmentation. If the confidence for unlabeled data is higher than a certain

threshold, the algorithms apply a cross-entropy loss between the prediction

from weakly and strongly augmented images.

Reformer: The Efficient Transformer

Authors: Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya Presenter:

Hongbo Tian

A visual depiction of attention with locality sensitive hashing. By hashing

and then bucketing keys, attention doesn’t need to handle all pair-wise


Training good transformers can be prohibitively expensive: the largest version

of GPT-2 contains 1.5 billion parameters, while NVIDIA’s Megatron-LM has up to

8.3 billion parameters. The authors of Reformer present a model that performs

on par with preceding transformers while being more memory-efficient and

faster to train. Due to the decreased resource requirements, Reformer makes

Transformers more widely available to use in practice while also being able to

handle longer sequences in general.

Rather than using the standard dot-product attention, Reformer uses a less

expressive attention mechanism combined with locality sensitive hashing, which

reduces the complexity of attention. The classical self-attention mechanism

attends to all pair-wise elements in the input sequence, giving it quadratic

algorithmic complexity. By reducing the expressiveness of self-attention,

Reformer can leverage locality-sensitive hashing via bucketing similar

elements together, thus simplifying self-attention to linearithmic complexity.

This mechanism ultimately enables Reformer to process the entirety of Crime

and Punishment in a single training example.

The Case for Learned Index Structures

Authors: Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean,

Neoklis Polyzotis Presenter:Hongbo Tian

A comparison of a traditional and learned hash map. Learned data structures

replace the hashing function with a model that achieves better performance.

Many classic data structures (such as B-Trees, Hash maps, and Bloom filters)

rely on some index structure to map a key to a position of a record. However,

index structures can be generalized as a model that takes in a key and infers

information about a data record. The authors explore replacing traditional

index structures with deep-learning models, and their results demonstrate that

machine learning can greatly increase speed of classic data structures that

have already been optimized over decades. This paper is novel in the sense

that it exposes opportunities for large performance improvements in data

structures that have already been optimized for decades. It also serves as a

reminder to our team at Scale AI, that excellence can exceed that of our

initial expectations and that machine learning can have very non-obvious use


PointRend: Image Segmentation as Rendering

Authors: Alexander Kirillov, Yuxin Wu, Kaiming He, Ross

Girshick Presenter:Felix Lau

Inference results on a scene with and without PointRend. PointRend is able

to handle the edges, such that it properly segments the side mirrors of a


Image segmentation can suffer in accuracy around the borders of various

classes and difficult regions. As a result, the authors of this paper create a

neural network module call PointRend that adaptively runs point-based

segmentation in higher resolutions at selected locations, leading to crisp

boundaries compared to previous segmentation techniques. PointRend can be

applied on top of existing instance and semantic segmentation tasks. At Scale

AI, we’ve found that getting the edges between classes right is the most

critical when it comes to producing high quality labels for 2D segmentation.

RangeNet++: Fast and Accurate LiDAR Semantic Segmentation

Authors: Andres Milioto, Ignacio Vizzo, Jens Behley, Cyrill

Stachniss Presenter:Daniel Havir

Diagram of the various states of data representation during LiDAR

segmentation. Each letter corresponds to a step in the approach that

RangeNet++ takes.

High-level semantic perception tasks in autonomous vehicles are primarily

solved through high resolution cameras, even though there are a wider variety

of sensor modalities available for autonomous vehicles. At Scale AI, we are

familiar with the problems defined in this paper. The paper details an

approach to improving LiDAR-only semantic segmentation. The authors use a

4-step approach:

  1. Reduce the LiDAR point cloud to two-dimensional range images that encode the
  2. Euclidean distance along with the original coordinates and LiDAR remission.

  3. Run the range image representation through a 2D semantic segmentation model.
  4. Project the model output back to the original point cloud format.

  5. Post-process the output using a novel KNN approach to account for occlusion
  6. errors introduced with the 2D projection.

Fast Differentiable Sorting and Ranking

Authors: Mathieu Blondel, Olivier Teboul, Quentin Berthet,

Josip Djolonga Presenter:Chiao-Lun Cheng

Illustration of the soft sorting and ranking operators.

Sorting and ranking are basic and common operations in computer science and

machine learning. However, due to their natural properties, they are not

differentiable. The authors propose the first differentiable sorting and

ranking operations, while maintaining their time and space complexity. To

reduce sorting and ranking to a differentiable form, the authors construct the

operators as projections onto a permutohedron, take the convex hull of the

ordering permutations, and convert the results to an isotonic optimization


If you’re interested in joining our growing engineering team, take a look at

our careers page for open positions.

The future of your industry starts here.