The ML Team at Scale hosts a weekly reading groups where members choose papers from the broad AI/ML community and discuss them ranging from topics in Computer Vision to NLP to Active Learning. Here, we describe a brief summary of some of the insights we gained from various papers and how we aim to use some of that knowledge in future research projects and applications to Scale AI’s business.
This is Scale’s first research paper and we wrote a blog post summarizing the paper. This paper focuses on whether we can adapt pre-trained language models off-the-shelf as universal decoders. To be considered “universal,” a decoder must have an implicit representation for any target sentence such that it can recover that sentence exactly when conditioned on its representation. We investigated whether such representations exist and whether they can be easily discovered. Experiments show that not only do these representations exist for sentences from a variety of different genres, but also that our methods can recover these sentences almost perfectly without fine-tuning the underlying language model at all.
The authors of this paper look at the problem of nearest neighbor search with respect to Wasserstein distance. As Wasserstein distance has become popular in high-resource data environments and domains (images, text, etc.), search becomes prohibitively slow. As a result, approximation methods are important. The authors introduce a Quadtree variant called a Flowtree that formally shows better asymptotic accuracy and empirically show that it beats existing methods on a variety of different real-world datasets on either running time or accuracy.
Flowtree is a fast nearest neighbor search algorithm for Optimal Transport (Wasserstein-1 distance), which has a linear running time and speeds up SOTA performance by up to 7x. They do this by evaluating the optimal tree flow in the original metric space.
At Scale AI, we work with lots of high dimensional data such as documents and images for a variety of tasks. For these, we often need to find good representation spaces for these data points such that we can do fast and highly accurate similarity search in order to find outliers in datasets in order to find errors in labeling and find examples to label to improve our models using active learning. We think the flowtree method here could be invaluable in giving us the ability to use a different distance for this end and is something we will explore going forward.
Structure-from-Motion(SfM) is a field of computer vision techniques which aim to reconstruct full 3-dimensional scenes from only a series of images. Accurate SfM algorithm can reproduce colored point clouds similar to LiDAR. The authors of this paper along with Schönberger’s thesis introduced a robust computational pipeline for accurate dense 3D reconstruction. The authors also introduced COLMAP - an open-sourced CUDA accelerated software to reproduce the results.
At Scale AI, our labelling pipeline is not subjected to the same online computational constraints compared to our customers. This allows us to pursue batch methods to enrich our 2D dataset, enabling our human labelers to capture 3D contexts from only images.
The authors aim to extract structured information from form-like documents. They observe that forms have fields that often correspond to well-understood types and are often associated with a specific key phrase that bears a visual relationship with it. They also observe that key phrases in documents are drawn from a small vocabulary. Due to these observations, the authors build a system with two parts: candidate generation and scoring & assignment.
In the candidate generation phase each field type is associated with a candidate generator based on a cloud-based entity extraction service. Spans are then detected. In an invoice there are multiple dates. Every date in an invoice becomes a candidate for every date field in the target schema. This process is repeated for every target field.
In the scoring and assignment phase, the goal is to find the correct extraction candidate for each field. To do this, a score is computed for each candidate independently using a neural model. Next, the most likely scored candidate is assigned for each field. This process yields an independent learned representation based only on the candidates’ neighborhood (this is entirely independent of other candidates of fields). The neural scoring model is below.
Our major takeaways are that structured extraction is not super well-studied by the academic community, but is relatively simple. The extraction system presented has promising accuracy and generalizes well to different domains. The representations have some interpretability as well. Ultimately some of the learnings from this paper could influence some of our work on Scale Document.
This paper focuses on how to transfer pre-trained representations to other tasks in the visual space. They propose a recipe called Big Transfer (BiT) and achieve very strong performance on ILSVRC-2012 (87.5% top-1 accuracy), CIFAR-10 (99.4%), and the Visual Task Adaptation Benchmark (VTAB; 76.3%).
BiT has two phases: upstream and downstream. In upstream pre-training, the authors first investigate the scale of computational budget and how that affects performance. Secondly, they look at ground normalization and weight standardization, which helps significantly with small-batch and large-batch training alike. In downstream transfer, the authors propose a cheap fine-tuning methodology by using the BiT-HyperRule to select the most important hyperparameters for tuning as a function of the task’s intrinsic image resolution and number of datapoints. The hyperparameters they thought were important were training schedule length, resolution, and whether to use MixUp regularization.
There are a few takeaways from this method we found interesting. There’s a lot of value in balancing the computational budget and simplifying the experimental pipeline when needed. The BiT-HyperRule focusing on only a few hyperparameters was illuminating. We were interested in the dynamics of how large batches, group normalization, and weight standardization interplayed and were surprised at how poorly batch normalization performed relative to group normalization and weight standardization for large batches. The empirical strength of group normalization and weight standardization surprised us as well and we will start exploring these techniques in some of our computer vision models with lots of data available. However, the most impressive part was the empirical few-shot learning performance relative to the baseline. We are trying to build models that quickly adapt to new customers and domains and often struggle with fine-tuning, so these ideas are very top of mind.
The authors present an active learning method that is a task-agnostic by attaching a loss prediction module to a target network and train it to be able to predict target losses of unlabeled inputs. The module can then provide insight into data that the model could predict incorrectly and thus improve the overall model by choosing just those examples to go and label.
Their general process is that for every labeled example you predict the target and predicted loss for your prediction. Next, you take the unlabeled pool and pass them through the model to get predicted losses. Lastly you annotate the top-k data points and add them to the labeled training set. The loss prediction module uses a ranking loss.
As a data labeling company, we are very interested in active learning and how we can leverage our own strengths to better build ML models. We really liked the approach because it is widely applicable, but were concerned that there could be many degenerate cases. We weren’t convinced that the loss prediction module would be strong enough to offer significant improvements over uncertainty or coverage based methods such as entropy, query-by-committee, and coreset based methods.
The authors try and solve the problem of automated understanding from scanned business documents. They use both computer vision and natural language processing techniques via textual and visual layout information for pre-training their system. Their system achieves SOTA on a set of layout analysis, receipt understanding, and document image classification tasks. They claim that their empirical performance is from their novel loss functions: a masked visual-language model (MVLM) loss and a multi-label document classification (MDC) loss.
MVLM extends the masked language model loss from BERT by randomly choosing some words to mask. They then keep the masked token’s 2D positional embedding and forces the model to predict the word’s text. MVLM forces the model to understand language context and 2D relationships between words in the document. MDC assumes that documents have possibly several labels and is a standard cross-entropy loss. This is optional and seems to help on the document image classification tasks.
We found this paper interesting, but computationally demanding: alongside their loss functions, a key contribution of this paper is their pre-training routine. Without significant pre-training, these models do not seem to work well. This work requires off-the-shelf OCR and treats that as ground truth, which simplifies the work, but makes it less practical for document analysis practitioners. For the business document tasks that our ML team takes on, we often like to have greater control on the OCR task. It's sometimes advantageous to be able to adjust OCR to provide better downstream task-dependent results. We do a lot of document analysis at Scale, but can’t obfuscate or explain away the OCR component, which do produce some errors (e.g. with handwriting). Although this approach needs word-level annotations, Scale is in a unique position to get large amounts of these labor-intensive labels using our labeling platform.
This paper focuses on tasks whose outputs have to obey some constraints. This happens in pseudo-code to code transcription for example. What we ultimately want is a code that compiles and runs without error. The usual way to address these tasks is to build a single end-to-end model and hope that it will learn the input-output mapping and understand the output constraints. The authors provide a new way to tackle these tasks by replacing the original model by two sub-models: a base predictor and a denoiser. The goal of the base predictor is to learn the mapping between the input and the output. The goal of the denoiser is to learn the constraints of the output space, in order to fix the base predictor’s output.
There is a reason why this decomposition works well. As the denoiser simplifies the job of the base predictor, it is now possible to increase the regularization of the base predictor without affecting the overall model performance. This in turn leads to solutions that generalize better on unseen data. The big win of this architecture is that the denoiser can be trained on unlabelled output data, which is often abundant (e.g. there is a huge amount of compiling code pieces freely available on github).
A big number of tasks actually impose constraints on the output (such as molecule generation, or translation from one language to another). At Scale, our ML team also tackles such problems. This paper provides a very general way to increase our downstream scores on them.