One of the biggest bottlenecks in developing machine learning (ML) applications is the need for the large, labeled data sets used to train modern ML models. Creating these data sets involves the investment of significant time and expense, requiring annotators with the right expertise. Moreover, because of the evolution of real-world applications, labeled data sets often need to be thrown out or relabeled.
In collaboration with Stanford University and Brown University, Google presents "Snorkel Drybell: A Case Study in Deploying Weak Supervision at Industrial Scale," which explores how existing knowledge in an organization can be used as noisier, higher-level supervision—or, as it is often termed, weak supervision—to quickly label large training data sets. This study uses an experimental internal system, Snorkel Drybell, which adapts the open-source Snorkel framework to use diverse organizational knowledge resources—such as internal models, ontologies, legacy rules, knowledge graphs, and more—in order to generate training data for ML models at web scale. This approach can match the efficacy of hand-labeling tens of thousands of data points and reveals some core lessons about how training data sets for modern ML models can be created in practice.
Rather than labeling training data by hand, Snorkel DryBell enables writing labeling functions that label training data programmatically. This work explored how these labeling functions can capture engineers' knowledge about how to use existing resources as heuristics for weak supervision. As an example, suppose the goal is to identify content related to celebrities. One can leverage an existing named-entity recognition (NER) model for this task by labeling any content that does not contain a person as not related to celebrities. This illustrates how existing knowledge resources (in this case, a trained model) can be combined with simple programmatic logic to label training data for a new model. Note also, importantly, that this labeling function returns None (i.e., abstains) in many cases and, thus, only labels some small part of the data; the overall goal is to use these labels to train a modern ML model that can generalize to new data.
This programmatic interface for labeling training data is much faster and more flexible than hand-labeling individual data points, but the resulting labels are obviously of much lower quality than manually specified labels. The labels generated by these labeling functions often will overlap and disagree because the labeling functions may not only have arbitrary unknown accuracies but may also be correlated in arbitrary ways (for example, from sharing a common data source or heuristic).
To solve the problem of noisy and correlated labels, Snorkel DryBell uses a generative modeling technique to automatically estimate the accuracies and correlations of the labeling functions in a provably consistent way—without any ground truth training labels—then uses this to reweight and combine their outputs into a single probabilistic label per data point. At a high level, we rely on the observed agreements and disagreements between the labeling functions (the covariance matrix), and learn the labeling function accuracy and correlation parameters that best explain this observed output using a new matrix completion-style approach. The resulting labels can then be used to train an arbitrary model.