History of a Breakthrough
In the early 1990s, Steve Gallant co-created a new technology in computer analysis of unstructured text: fully distributed vector representations. Unlike rules-based Natural Language Processing (NLP) which traces its roots to the 1950s, vector representations offered an ideal way to represent text to leverage the power of machine learning algorithms for creating powerful predictive models.
But distributed vector representations, including modern versions like GloVe vectors, have a weakness: they are based on “a bag of words.” Thus, much of the information in the text is lost – negation, parse structure, etc. This limits the amount of information captured in the vector, and therefore limits the accuracy of any model using them. This issue interested Professor Gallant for almost 20 years.
Then, in 2013, he had a breakthrough, finding a good way to include arbitrary structure in the same vector as the words. This is the NoNLP™ technology patented by Textician. Machine learning on NoNLP vectors accesses a much deeper set of information in the vector, leading to powerful models.
Multi-label Classification: Given a text, what subset of a list of labels best apply to that text? For example, given a financial news story, it is about China, commodities, and/or mining? This is a very difficult problem technically, especially as the list of labels gets large. We solve it with ease: live demo.
Decisions from text: Given a text, what is the probability that X will be true? For example, in healthcare X could be predicting an acute condition such as being diagnosed with severe sepsis in the future, or readmission to the Emergency Department within a month. We’ve just begun investigating such Risk Stratification with actual hospital data, and with actionable results. See preliminary work with Baystate Health (pdf).
Conceptual Search: Given a text, what previously-seen text is the most conceptually similar with it? Word search is simple, but what if you have an article about “hearts” and you’d like it to match articles about “cardiac”? Or text about “physicians” that should automatically match text about “doctors” or “surgeons”? One of the properties of NoNLP representation is that similar concepts map to similar vectors (while being aware of negations), a pattern that generalizes machined-learned models without the need for human programming.
Integrating Text with Data Modeling: Data scientists struggle to integrate unstructured data like text into their structured data models. The challenge is so great that often the text is simply neglected. However, the NoNLP representation is a fixed length vector, so integrating with structured data is as easy as appending that data to the vector. Machine learning can then take place over both types of data at the same time.
Textician encapsulates NoNLP technology and relevant machine learning and modeling into a server-based application. This can be delivered in one of two ways:
OEM: The server application runs in a virtual server alongside the OEM application, and is accessed via an API. (Contact us for the documentation.)
SaaS: We host the server application, including a high-security interface to the clients. We provide a DLL for integration into client software.
The high-security interface is more than just encryption and HTTPS. Rather, we take advantage of inherent attributes of NoNLP technology to operate on the text at the server without sending the raw text off the client device. For HIPAA-compliant applications and other use cases which require the highest level of data privacy, the risk of data leakage from the server is virtually eliminated.
“AI” and “machine learning” encompass a number of different technologies. Here’s a summary:
Compared with Rules-based NLP: The most commonly deployed technology for automatic inference from unstructured text is rules-based natural language processing. In such a system, humans write explicit rules to teach the computer to parse and interpret the input text. Rules-based NLP systems are very brittle. Although they perform well in domains where the rules are followed, they tend to perform poorly for anything outside the pre-specified rules. Consider a system whose rules all refer to a “doctor” that is then presented with text about a “surgeon” or a “physician.” Clearly a rule could be written to overcome this particular synonym problem, but that then points to another weakness: rules-based NLP technology does not scale well as problem complexity grows. Handling negations makes it even more difficult for human-coded rules. This leads typical medical coding NLP systems to have over 500,000 rules, which need to be maintained by an expensive staff of humans, and that results in both high cost and poor response times to queries.
Compared with Statistical NLP: Statistical methods, like vector methods before NoNLP representation, have limited applicability when the text is varied and nuanced. For instance, no two doctors enter notes the same way, using the same abbreviations and shorthand, thus severely limiting the use of these methods on healthcare records.
Compared with Deep Learning and Neural Networks: Deep Learning models perform exceptionally well in some cases. However, the training algorithms are not guaranteed to converge on a solution, and design of the model (e.g., how many layers of how many neurons that connect to how many others) requires a great deal of expertise. Further, Deep Learning requires a very large set of training data and the compute power to process it. This limits the ability of Deep Learning to handle certain scaled-out problems, such as multi-label classification.