Scale AI Machine Learning Digest - Q1 2020

by Andrew Liu on April 6th, 2020

Scale AI Machine Learning Digest - Q1 2020 cover

Machine learning is a field of study with tremendous strides being made through active academic research. The machine learning team at Scale AI regularly hosts reading groups to discuss papers they find interesting to give us context in a rapidly evolving, research-heavy field. In this blog post, we go over the papers the team has read throughout the quarter and provide insights on how a paper influenced our own work here at Scale AI when relevant.

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Authors: Jonathan Frankle, Michael Carbin Presenter: Hongbo Tian

Pruned neural network architectures created by removing unnecessary weights from trained dense neural networks can attain various benefits without sacrificing accuracy. For example, smaller architectures have lower storage requirements, and certain hardware architectures can take advantage of sparsity, leading to massive computational performance gains at inference time. However, training the pruned network from scratch (as opposed to training the full network and pruning) has historically achieved poor accuracy.

This paper presents the hypothesis that dense neural networks will contain some subnetwork that can reach the same performance of the original network when trained in isolation, and the authors cover some techniques to uncover the “winning subnetwork.” Larger network architectures have typically been considered superior, but this paper demonstrates that such a belief is not necessarily true. The lottery ticket conjecture matters when developing machine learning models here at Scale AI, since we know that we can build networks that tackle ambitious challenges without being blocked on unreasonable monetary or compute requirements.

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Authors: Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, David Silver Presenter: Hongbo Tian

MuZero’s performance, compared to the state of the art. After sufficient training, MuZero’s Elo (denoted by the blue lines) surpasses that of AlphaZero for boardgames and R2D2 in Atari.

Games like chess and Go tend to have deterministic and well-defined rules. Such games aren’t representative of real world problems because most realistic problems occur in an environment with complex or unknown rules. This paper presents MuZero, a successor to AlphaZero that matches AlphaZero’s performance in Go but can learn to play Atari games, all without any prior knowledge to the game rules. MuZero builds off of AlphaZero’s search algorithms, but it also uses a learned environment model when training to capture aspects of the future that are relevant for planning.

Although MuZero don’t know the game rule or environmental dynamics of the challenges it faces, it matches AlphaZero in logically complex games and beats the state-of-the-art model-free reinforcement learning algorithms in visually complex games (e.g. Atari games). Such a feat makes MuZero a pioneer in developing AI to solve realistic problems where there’s no perfect simulator and demonstrates the feasibility of applying learning methods to more ambiguous problems.

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

Authors: Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, Colin Raffel Presenter: Felix Lau

The general architecture of FixMatch. The model is trained to make predictions on a strongly-augmented version of an unlabeled image, based off the prediction results of a weakly-augmented version of the same image.

The authors combine two common semi-supervised learning methods, consistency regularization and pseudo-labeling, to create a new algorithm that achieves state-of-the-art performance on standard, semi-supervised learning benchmarks. For unlabeled data, the algorithm applies both strong augmentation and weak augmentation. If the confidence for unlabeled data is higher than a certain threshold, the algorithms apply a cross-entropy loss between the prediction from weakly and strongly augmented images.

Reformer: The Efficient Transformer

Authors: Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya Presenter: Hongbo Tian

A visual depiction of attention with locality sensitive hashing. By hashing and then bucketing keys, attention doesn’t need to handle all pair-wise keys.

Training good transformers can be prohibitively expensive: the largest version of GPT-2 contains 1.5 billion parameters, while NVIDIA’s Megatron-LM has up to 8.3 billion parameters. The authors of Reformer present a model that performs on par with preceding transformers while being more memory-efficient and faster to train. Due to the decreased resource requirements, Reformer makes Transformers more widely available to use in practice while also being able to handle longer sequences in general.

Rather than using the standard dot-product attention, Reformer uses a less expressive attention mechanism combined with locality sensitive hashing, which reduces the complexity of attention. The classical self-attention mechanism attends to all pair-wise elements in the input sequence, giving it quadratic algorithmic complexity. By reducing the expressiveness of self-attention, Reformer can leverage locality-sensitive hashing via bucketing similar elements together, thus simplifying self-attention to linearithmic complexity. This mechanism ultimately enables Reformer to process the entirety of Crime and Punishment in a single training example.

The Case for Learned Index Structures

Authors: Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis Presenter: Hongbo Tian

A comparison of a traditional and learned hash map. Learned data structures replace the hashing function with a model that achieves better performance.

Many classic data structures (such as B-Trees, Hash maps, and Bloom filters) rely on some index structure to map a key to a position of a record. However, index structures can be generalized as a model that takes in a key and infers information about a data record. The authors explore replacing traditional index structures with deep-learning models, and their results demonstrate that machine learning can greatly increase speed of classic data structures that have already been optimized over decades. This paper is novel in the sense that it exposes opportunities for large performance improvements in data structures that have already been optimized for decades. It also serves as a reminder to our team at Scale AI, that excellence can exceed that of our initial expectations and that machine learning can have very non-obvious use cases.

PointRend: Image Segmentation as Rendering

Authors: Alexander Kirillov, Yuxin Wu, Kaiming He, Ross Girshick Presenter: Felix Lau

Inference results on a scene with and without PointRend. PointRend is able to handle the edges, such that it properly segments the side mirrors of a car.

Image segmentation can suffer in accuracy around the borders of various classes and difficult regions. As a result, the authors of this paper create a neural network module call PointRend that adaptively runs point-based segmentation in higher resolutions at selected locations, leading to crisp boundaries compared to previous segmentation techniques. PointRend can be applied on top of existing instance and semantic segmentation tasks. At Scale AI, we’ve found that getting the edges between classes right is the most critical when it comes to producing high quality labels for 2D segmentation.

RangeNet++: Fast and Accurate LiDAR Semantic Segmentation

Authors: Andres Milioto, Ignacio Vizzo, Jens Behley, Cyrill Stachniss Presenter: Daniel Havir

Diagram of the various states of data representation during LiDAR segmentation. Each letter corresponds to a step in the approach that RangeNet++ takes.

High-level semantic perception tasks in autonomous vehicles are primarily solved through high resolution cameras, even though there are a wider variety of sensor modalities available for autonomous vehicles. At Scale AI, we are familiar with the problems defined in this paper. The paper details an approach to improving LiDAR-only semantic segmentation. The authors use a 4-step approach:

  1. Reduce the LiDAR point cloud to two-dimensional range images that encode the Euclidean distance along with the original coordinates and LiDAR remission.
  2. Run the range image representation through a 2D semantic segmentation model.
  3. Project the model output back to the original point cloud format.
  4. Post-process the output using a novel KNN approach to account for occlusion errors introduced with the 2D projection.

Fast Differentiable Sorting and Ranking

Authors: Mathieu Blondel, Olivier Teboul, Quentin Berthet, Josip Djolonga Presenter: Chiao-Lun Cheng

Illustration of the soft sorting and ranking operators.

Sorting and ranking are basic and common operations in computer science and machine learning. However, due to their natural properties, they are not differentiable. The authors propose the first differentiable sorting and ranking operations, while maintaining their time and space complexity. To reduce sorting and ranking to a differentiable form, the authors construct the operators as projections onto a permutohedron, take the convex hull of the ordering permutations, and convert the results to an isotonic optimization problem.

If you’re interested in joining our growing engineering team, take a look at our careers page for open positions.