NLP Innovations with Small Datasets
A basic foundation of Deep Learning (DL) neural networks is large corpuses of data that have been curated and labeled to train the models on.
With large data sets, appropriate labeling of that data, the training data, and combined with some test data to add to the mix, results may follow that are nothing short of magical, or at least appear to be so. Starting from a popular large, curated and labeled data set of ImageNet the viewed maturity of machine learning (ML) models on image characterization and object recognition within images can be extraordinary. Though predictions on what an image is or what is happening within an image shows great promise as demonstrated by the many examples available on the Internet, there are also large gaps and missed predictions. This is largely because the training data doesn’t have some large categories of data. Even large data trained Natural Language Processing (NLP) deep learning models like GPT-3, released in May of 2020, demonstrate great promise but also have flaws and gaps. GPT-3 was trained on 187 billion tokens (e.g. words or subwords) that come from the CommonCrawl corpus which is a dataset of website crawl data and is enormous. My point in mentioning these is the premise of deep learning models is built on big data and, in many cases, data that has been painstakingly labeled. However, most business problems aren’t backed by big data; at least, not data large enough to train deep learning ML models for effective prediction results. If most real business problems come with, and are backed by, “small data” then how does one use deep learning techniques to help solve real business problems? Or, in other words and a bit more specific, how might one use NLP techniques to help in automation tasks when the data sets of the business problem are “small” by comparison to the much publicized “large data” models that are demonstrated to great fanfare seemingly all around us?
At DeepSee we have two views on this real problem of small data backing large customer efficiency challenges. The first is something we call, internally, our ML Pipeline. The second potential help to small data problems is with our heavily fine-tuned FILBERT set of models. As FILBERT is described in another blog post I won’t describe it in detail here but instead a higher- level use case set for it.
What is an ML Pipeline, at least, as I presented the problem? I think of the many rules, rule chains, and conditions encountered in ingesting dozens, hundreds, or thousands of documents looking for terms, or other more nuanced conditions, as a machine learning pipeline. Along the way, the pipeline uses popular NLP libraries like spaCy1, SparkNLP2, and/or many others. Some conditions encountered in these documents also may require some proprietary TensorFlow code, or similar, as well. The point is that much of these small-data business problem solutions come from hard and arduous effort; “blue collar”-like work. Building our ML pipelines in this manner has, in time, and with much effort, resulted in a suite of capabilities that can be generalized and used in various conditions, but getting there involved, and continues to involve, a lot work akin to “digging ditches” and “driving trucks.” In spite of the effort and time required, we are genuinely excited about what we have built and as we continue to build in increasing the capabilities of this ever-growing set of rules and extensions to accommodate an ever widening of documents and document types.
Some of the problems of small-data business challenges can still be solved by big-data NLP models if they are tuned and trained well and the use case is well defined. For instance, at DeepSee our FILBERT3 set of models is helping us find “interesting” words, clauses, comparatives, categorizations, et al, but struggles, acting alone, with more hard-core reconciliation or validation conditions. Time will tell if we can leverage the large transfer- learning and fine-tuning of BERT models, like with FILBERT, to solve narrowed business automation challenges, with its accompanying small data, as the ML Pipeline seems to be able to solve.
We think that regardless of the automation opportunity that has a potential paucity of data we have solutions to bring the benefits of NLP DL to bear and bring efficiencies.
1 https://spacy.io/ and https://github.com/explosion/spaCy
2 https://nlp.johnsnowlabs.com/ and https://github.com/JohnSnowLabs/spark-nlp
3 FILBERT – Transformer based models (BERT) that have been fine-tuned with corpuses of 3b+ tokens from documents focused on the language of Finance, Insurance and Legal (FIL) thus leading to the name. More information about FILBERT is in another blog post on this site.