August 17, 2020. Wacey Richards, Bryan Sparks

D3O: Insights Storage from Unstructured Text

As part of the DeepSee platform and, more particularly, its ingestion of unstructured data, an array of data preparation functions and machine learning models are employed that give various insights and/or different views to the original source text. The DeepSee platform accommodates the addition of new functions or machine learning models to be added in these data processing steps. Further, in addition to insights from an individual document, there may be correlations between documents or attributes that span a number of documents whether these correlations were expected or not.

One way to think of this is as a manufacturing line. A source document is “dropped” at the start of the line and put in a “box”. The first station down the conveyor belt, “picks up” the box and the source document and “interrogates” it in some way. The result of this interrogation is added to the box, that contains the source document, and the box is placed back on the line. This may continue for several stations. Some of the stations, down the line, may not pick up the source document but instead may pick up one of the results of a prior station and add to it. The number of steps in this processing line will change and be added to as new insight methods are employed that provide specific views.

If these insights are stored in ways that they can be used to inform “down the line” ML models, or functions that are run later, then an impressive number of insights have been added to the original source that can lead to insights or actions that would be very difficult to divine from more traditional processes. The “box”, in our example, is an extensible data storage of the source document(s), gathered insights, and methods for accessing these insights for further processing. This storage mechanism is the heart of the motivation of the DeepSee Data Definition Object or D3O.

The D3O can be thought of as an extensible “draping” of attributes to the original imported unstructured data. This collection of attributes may be simple observations, like:

  • raw text of the original data,
  • what text was bolded or otherwise highlighted,
  • text formatting or layout within the original source documents,
  • unique or, perhaps, defined terms extracted from these documents,
  • text summarization,
  • X, Y coordinates within the original document, say a PDF, of “interesting terms”,
  • embedded tables or images,
  • and many others.

Beyond these simple attributes, additional attributes many also include:

  • document categorizations (“what kind of document is this?”),
  • references to external sources (e.g. URL links or references to other documents),
  • sentiments,
  • common or uncommon text structures,
  • exogenous data correlated to specifics found in the documents; like, news, regulatory filings, etc.,
  • and others.

From a more technical perspective, access to, and extension of, the D3O is a suite of APIs, or Python library functions that have wrapped the APIs, to various information points and ultimately the data lake of stored attributes or insights the DeepSee platform has found. The D3O and its data lake may be comprised of:

  • additional files,
  • structured data extracted from the unstructured imported documents in databases, like SQL databases,
  • correlations with other documents stored in graph databases,
  • or many other storage methods containing some insights.

For example, a D3O for a particular ingested document might describe the location of some structured data found and stored along with methods to access this structured data. This structured data is extracted it may be stored in a traditional SQL database under some term or searchable method. The D3O could be interrogated to give the location and appropriate search details necessary to access directly this SQL database and the data desired.

The D3O is an innovation that DeepSee believes places the imported document (unstructured data) as the “center of the universe” of the DeepSee platform. Around this document are attributes and insights gained by looking at and parsing this document and other documents like it. Further, if this document is found to be related or similar to other documents in the overall collection then these relationships are exposed. Upon the D3O a wide array of capabilities, visualizations, discrete solutions, and extensible services can be built all while adding value to the importance of the original source documents.

Two Use Cases

Data Scientist Use Case

Let’s assume you are a data scientist and are tasked with data preparation of a corpus of unstructured data upon which you will do some ML focused activity against this data. Upon an import of a document to the DeepSee platform, the D3O will initially house the source document and a collection of data preparation derivatives from this document as well as views to the data that could be used by your ML models.

It may also be that you are a Subject Matter Expert (SME), or have access to an SME, of the more nuanced details within the corpus. In unstructured data, there is in many cases a general challenge to inform ML models of the terms, weights, and biases of some of the details of the corpus that may not be readily available from a cursory read. Outside of the scope of this post but highlighted in another post, is a DeepSee tool, Atlas, that helps capture the “tribal knowledge” of an SME on parsing, highlighting, calculating, and defining features from this corpus. This tribal knowledge capture and subsequent storage of insights particular this these guides given by the SME are also stored with the D3O.

To a data scientist of unstructured data, these data preparation and feature engineering steps can be very difficult and can take a long time, generally. The DeepSee platform, its tools (e.g. Atlas), and the storage of these data preparation and feature engineering outputs make your tasks simpler. The D3O can point you, and thus your models, to already prepared data from the original sources. Further, the insights previous performed by perhaps other models is also available to you, and your models, for use as inputs. If your introduced models give some additional insights then storage of these insights using D3O allows for future functions to make use of them in canonical ways.

Visualization or Solutions Engineer

If you are an engineer and tasked with providing some visualization, like a dashboard, or to build a discrete application to solve some particular problem then the D3O can assist in several ways. Let’s take, as an example, that within a corpus of unstructured data, there are components hidden that might be more structured if it were found and extracted. As part of a processing step in the D3O enhancement pipeline, these structured elements could be stored in more traditional structured databases. These insights might be stored in a more common SQL DB and as part of your visualization or solutions build querying the D3O on locations of these structured elements allows for you to make direct queries against these structured DB data stores.

In short, the D3O can assist in converting, and then finding and accessing, a subset of structured data from a larger unstructured corpus for easier access and manipulation.