😲 Generative Data-Driven Programming

Oliver Jack Dean

For many real-world applications, creating labelled training datasets is the most time-consuming and expensive part of applying deeper Machine Learning (ML) models.

With this, it is very time-consuming for ML models to label and create subsets out of larger datasets.

Users often need to perform generative models over large hand-labelled training data sets, which are “expensive to create” and often require “labellers to be experts in the application domain”.

Furthermore, many have adopted supervised ML approaches that combine “weak supervision strategies” to collect training data for deeper ML models such as Neural Networks.

Often, when implementing this approach, the Deep Learning (DL) models applied become limited in scope and range due to the noisy classification of data provided by the training dataset.

How do you get a large enough set of training data to power modern DL models? Here, we have a central problem for ML, which has been continuously dominating research in the field.

Some papers I have been reading recently, most notably, this paper here and here, have been trying to identify new generative probabilistic models to tackle such problems.

Each model aims to attain more intuitive accuracy during the labelling process of collecting training data.

Such research builds upon the trials and errors of “Distant Supervision” and research conducted by Ratner et al., (such as their Snorkel model), and the Sparser project, a model developed by the fine folks at Standford.

Snorkel is pretty impressive

It tries to combine and interweave “weak supervision strategies” for collecting training data from “heuristic models”, external “knowledge bases”, “crowd-sourced workers” to try to provide more precise accuracy when labelling subsets/classifying data.

In the past, solely relying upon the combination of “weak supervision strategies” to accumulate training data for deeper ML models was very restrictive.

What presents itself now is how to balance both accuracy and performance and how to find a model that can aggregate between the two? This is where a model like Snorkel comes into the picture.

Snorkel allows users to define “labelling functions”, which are essentially Python functions that can communicate isomorphisms between heuristics, external knowledge bases, and other “supervised” data models.

Given an input (a data point from some source), an independent labelling function under some condition can either output a label (classifier) or return null.

If the label is accurate, the label is more positive, and if the label is less accurate, the label is more damaging.

What is interesting is that for any unlabelled data, once labelling functions have finished executing, Snorkel takes an accumulative ratio of the positive and negative labels classified, and based on a percentage, decides on whether to use either:

  • A) majority voting model; or
  • B) to build a generative model of function accuracy and perform a weighted voting model; i.e. to fetch more label accuracy for any unlabelled data.

In doing so, Snorkel can “select features”, control “dependencies”, and thus, become more specialised.

It can formulate and calculate a generative model that gradually selects over time, using one of two options, a process to eliminate noisy data or data in conflict and gradually label more data accurately and efficiently in parallel.

As a result, training datasets are more intuitive, cleaner and larger in scope and range.

A more technical breakdown can be found here.

As you can imagine, this has sparked a great deal of interest from many different institutions and research organisations.

Furthermore, it has resulted in a new paradigm called “Data Programming” or “Label Engineering” where users can compose small functions to help accumulate and classify more accurately vast sums of data from different data sources.

This inevitably creates an intuitive generative model to enable collect training data under constrained conditions.

In Summary

In a nutshell, Data Programming or Label Engineering is trying to eliminate specific challenges around the following well-known problems associated with accumulating training data for deeper ML models:

“hand-labelled” - as touched upon previously, is prohibitively expensive to obtain in sufficient quantities and requires expensive domain expert labellers [...]

“related external knowledge bases”- are either unavailable or insufficiently specific, precluding a traditional distant supervision or co-training approach.

Snorkel has only been experimented with so far, but I imagine it will begin to pick up over time.

The use cases highlighted in the research papers have claimed that out of several participants, on average, 4.5 hours of instruction on how to use and evaluate models was required and then 2.5 hours to program a series of labelling functions.

In particular, “the majority (8) of subjects matched or outperformed these hand-labelled data models” in under 8 hours.

Although there are still limits, projects such as Snorkel and Sparser are making great strides, and it’s interesting to see how generative approaches provide alternative automatic feature generation techniques for accumulating training data in conjunction with deeper learning models. 

In addition, questions are beginning to appear regarding how Data Programming will affect businesses and institutions in the near future.

Aside from the more theoretical problems associated with ML, an additional motive for research projects like Snorkel is to find a way for users “without expertise in ML/DL” to be “more productive iterating” through vast sums of data by programming independent “labelling functions”.

I can imagine this would be welcomed within domains such as bioinformatics & medical software houses.