Machine learning is a field of study with tremendous strides being made
through active academic research. The machine learning team at Scale AI
regularly hosts reading groups to discuss papers they find interesting to give
us context in a rapidly evolving, research-heavy field. In this blog post, we
go over the papers the team has read throughout the quarter and provide
insights on how a paper influenced our own work here at Scale AI when
relevant.
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Authors: Jonathan Frankle, Michael Carbin
Presenter:Hongbo Tian
Pruned neural network architectures created by removing unnecessary weights
from trained dense neural networks can attain various benefits without
sacrificing accuracy. For example, smaller architectures have lower storage
requirements, and certain hardware architectures can take advantage of
sparsity, leading to massive computational performance gains at inference
time. However, training the pruned network from scratch (as opposed to
training the full network and pruning) has historically achieved poor
accuracy.
This paper presents the hypothesis that dense neural networks will contain
some subnetwork that can reach the same performance of the original network
when trained in isolation, and the authors cover some techniques to uncover
the “winning subnetwork.” Larger network architectures have typically been
considered superior, but this paper demonstrates that such a belief is not
necessarily true. The lottery ticket conjecture matters when developing
machine learning models here at Scale AI, since we know that we can build
networks that tackle ambitious challenges without being blocked on
unreasonable monetary or compute requirements.
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Authors: Julian Schrittwieser, Ioannis Antonoglou, Thomas
Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward
Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, David Silver
Presenter:Hongbo Tian
MuZero’s performance, compared to the state of the art. After sufficient
training, MuZero’s Elo (denoted by the blue lines) surpasses that of
AlphaZero for boardgames and R2D2 in Atari.
Games like chess and Go tend to have deterministic and well-defined rules.
Such games aren’t representative of real world problems because most realistic
problems occur in an environment with complex or unknown rules. This paper
presents MuZero, a successor to AlphaZero that matches AlphaZero’s performance
in Go but can learn to play Atari games, all without any prior knowledge to
the game rules. MuZero builds off of AlphaZero’s search algorithms, but it
also uses a learned environment model when training to capture aspects of the
future that are relevant for planning.
Although MuZero don’t know the game rule or environmental dynamics of the
challenges it faces, it matches AlphaZero in logically complex games and beats
the state-of-the-art model-free reinforcement learning algorithms in visually
complex games (e.g. Atari games). Such a feat makes MuZero a pioneer in
developing AI to solve realistic problems where there’s no perfect simulator
and demonstrates the feasibility of applying learning methods to more
ambiguous problems.
FixMatch: Simplifying Semi-Supervised Learning with Consistency and
Confidence
Authors: Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao
Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, Colin Raffel
Presenter:Felix Lau
The general architecture of FixMatch. The model is trained to make
predictions on a strongly-augmented version of an unlabeled image, based off
the prediction results of a weakly-augmented version of the same image.
The authors combine two common semi-supervised learning methods, consistency
regularization and pseudo-labeling, to create a new algorithm that achieves
state-of-the-art performance on standard, semi-supervised learning benchmarks.
For unlabeled data, the algorithm applies both strong augmentation and weak
augmentation. If the confidence for unlabeled data is higher than a certain
threshold, the algorithms apply a cross-entropy loss between the prediction
from weakly and strongly augmented images.
Reformer: The Efficient Transformer
Authors: Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya Presenter:
A visual depiction of attention with locality sensitive hashing. By hashing
and then bucketing keys, attention doesn’t need to handle all pair-wise
keys.
Training good transformers can be prohibitively expensive: the largest version
of GPT-2 contains 1.5 billion parameters, while NVIDIA’s Megatron-LM has up to
8.3 billion parameters. The authors of Reformer present a model that performs
on par with preceding transformers while being more memory-efficient and
faster to train. Due to the decreased resource requirements, Reformer makes
Transformers more widely available to use in practice while also being able to
handle longer sequences in general.
Rather than using the standard dot-product attention, Reformer uses a less
expressive attention mechanism combined with locality sensitive hashing, which
reduces the complexity of attention. The classical self-attention mechanism
attends to all pair-wise elements in the input sequence, giving it quadratic
algorithmic complexity. By reducing the expressiveness of self-attention,
Reformer can leverage locality-sensitive hashing via bucketing similar
elements together, thus simplifying self-attention to linearithmic complexity.
This mechanism ultimately enables Reformer to process the entirety of Crime
and Punishment in a single training example.
The Case for Learned Index Structures
Authors: Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean,
Neoklis Polyzotis Presenter:Hongbo Tian
A comparison of a traditional and learned hash map. Learned data structures
replace the hashing function with a model that achieves better performance.
Many classic data structures (such as B-Trees, Hash maps, and Bloom filters)
rely on some index structure to map a key to a position of a record. However,
index structures can be generalized as a model that takes in a key and infers
information about a data record. The authors explore replacing traditional
index structures with deep-learning models, and their results demonstrate that
machine learning can greatly increase speed of classic data structures that
have already been optimized over decades. This paper is novel in the sense
that it exposes opportunities for large performance improvements in data
structures that have already been optimized for decades. It also serves as a
reminder to our team at Scale AI, that excellence can exceed that of our
initial expectations and that machine learning can have very non-obvious use
cases.
PointRend: Image Segmentation as Rendering
Authors: Alexander Kirillov, Yuxin Wu, Kaiming He, Ross
Girshick Presenter:Felix Lau
Inference results on a scene with and without PointRend. PointRend is able
to handle the edges, such that it properly segments the side mirrors of a
car.
Image segmentation can suffer in accuracy around the borders of various
classes and difficult regions. As a result, the authors of this paper create a
neural network module call PointRend that adaptively runs point-based
segmentation in higher resolutions at selected locations, leading to crisp
boundaries compared to previous segmentation techniques. PointRend can be
applied on top of existing instance and semantic segmentation tasks. At Scale
AI, we’ve found that getting the edges between classes right is the most
critical when it comes to producing high quality labels for 2D segmentation.
RangeNet++: Fast and Accurate LiDAR Semantic Segmentation
Authors: Andres Milioto, Ignacio Vizzo, Jens Behley, Cyrill
Stachniss Presenter:Daniel Havir
Diagram of the various states of data representation during LiDAR
segmentation. Each letter corresponds to a step in the approach that
RangeNet++ takes.
High-level semantic perception tasks in autonomous vehicles are primarily
solved through high resolution cameras, even though there are a wider variety
of sensor modalities available for autonomous vehicles. At Scale AI, we are
familiar with the problems defined in this paper. The paper details an
approach to improving LiDAR-only semantic segmentation. The authors use a
4-step approach:
- Reduce the LiDAR point cloud to two-dimensional range images that encode the
- Euclidean distance along with the original coordinates and LiDAR remission.
- Run the range image representation through a 2D semantic segmentation model.
- Project the model output back to the original point cloud format.
- Post-process the output using a novel KNN approach to account for occlusion
- errors introduced with the 2D projection.
Fast Differentiable Sorting and Ranking
Authors: Mathieu Blondel, Olivier Teboul, Quentin Berthet,
Josip Djolonga Presenter:Chiao-Lun Cheng
Illustration of the soft sorting and ranking operators.
Sorting and ranking are basic and common operations in computer science and
machine learning. However, due to their natural properties, they are not
differentiable. The authors propose the first differentiable sorting and
ranking operations, while maintaining their time and space complexity. To
reduce sorting and ranking to a differentiable form, the authors construct the
operators as projections onto a permutohedron, take the convex hull of the
ordering permutations, and convert the results to an isotonic optimization
problem.
If you’re interested in joining our growing engineering team, take a look at
our careers page for open positions.