In our architecture, there is a strong interface between Refinement and Recognition. It’s a Representation of the input data that we assume to be:
- Noise-reduced through dynamic levelling and averaging
- Enhanced for features that assist/speed Recognition
- Compressed for efficient transmission
- Keeping as much signal as possible
Thus, the interface from Refinement to Recognition is a NECK. Yes, that’s an intentional pun: we know from neural network research that narrow layers – necks — are crucial to preventing overfitting to the data.[1] So, the Representation is an important element of the architecture: it enables Recognition to learn.
It turns out that biology has evolved this architectural structure many times. Just in your own body:
- The optic nerve carries a representation of the visual world that has been pre-processed by the structures in the eye – particularly the retina – to deal with different light level, enhance edges, etc. One of the best known representations is the Opponent Process for transmitting color information from 4 types of input receptors to the brain on 3 channels.
- The cochlear nerve carries a representation of the auditory world that has been pre-processed by the structures in the ear – particularly the cochlea – to deal with different sound intensity, to break out the sound frequency spectrum, etc.
- The olfactory nerve carries a representation of smell that has been pre-processed by the structures in the nose – particularly the olfactory bulb – to reduce the noise from multi-scent receptors, to enhance strong signals, to level out consistent odors, etc.
Deep Learning
It’s clearly good system design. That’s why it’s so surprising that early “neural inspired” systems seem to have missed it! Early Deep Learning systems attempted to start directly from raw inputs, with the hope that the system would develop an “internal representation” through brute force (i.e., lots of data and lots of GPUs).[2] That does not happen reliably, and when it does, there is no way to analyze why it works.
Which is a reason for the proliferation in Deep Learning structures, including Convolutional Neural Networks (CNNs) which have a front-end processing.[3]
Further, researchers[4] showed that it’s most efficient to pre-train or train a deep system layer by layer. A later analysis of why this is so[5] applied “denoising autoencoders” to bias the system to more optimal solutions. In our terms, the approaches create noise-reduced Representations to pass forward, but as we’ll see, strong Representations of text are better created algorithmically.
Distributed Vector Representations
Natural Language Processing (NLP) systems also pre-process their inputs, often through a process of feature extraction and parsing, but this is a process of creating metadata more than it is a process of noise reduction or compression.
NLP proved insufficient for certain information retrieval and search applications in the early 1990’s. Neural Networks methods were also insufficient. Instead, a new technology branch was invented (partially by Steve Gallant): distributed representation of text on fixed-length vectors, originally referred to as “Context Vectors”.[6] word2vec and GloVe vectors are the descendants of this work.
Distributed representations on vectors are produced algorithmically through simple operations. While not human-readable, they have proven very powerful as input to machine-learned Recognition. However, as “bag of words” algorithms, word2vec and GloVe both dispense with all of the structure in the text, from phrases and sentences to paragraphs and sections. Thus, they don’t Keep as much information as possible.
This is most apparent in negation. In the medical records we work with, a doctor may write, “… other inflammation but rules out pneumonia.” With a simple bag-of-words representation, this is identical to “… pneumonia but rules out other inflammation” which clearly does not mean the same thing! NoNLP™ representation (a vector representation invented by Steve Gallant) includes structural information into the Representation, making it possible to keep the negation structure intact. We have shown that this enables better modeling.
A true interface
In a good architecture, interfaces enable optimization of both sides independently. Vector-based representations have a strong advantage here over language-specific NLP and hidden Deep Learning layers. Specifically, any structured machine learning algorithm can be applied to the vectors to create Recognition models, allowing a choice between speed and accuracy with provable convergence.
Conclusion
Representation – the interface between Refinement and Recognition – is key to efficient machine learning in the latter. However, it is often overlooked (e.g., in deep learning) or over-specific (e.g., in NLP). A strong representation brings down the noise in the input data and enables rapid creation of Recognition models.
[1] While a “narrow” layer of neurons is well-known to assist in generalization and to reduce over-fitting, the design paradigm has reversed over time. Once upon a time, models were designed with narrow layers that were added to until results were acceptable. Now, the predominant method is to drop out neurons that appear unnecessary. Either way, the result is a neck in the model!
[2] “Deep learning is also known for its ability to self-generate intermediate representations, such as internal units that may respond to things like horizontal lines, or more complex elements of pictorial structure.”
G. Marcus, Deep Learning: A Critical Appraisal. Marcus, G. arxiv.org/ftp/arxiv/papers/1801/1801.00631.pdf (PDF)
[3] “We view the convolutional ‘front ends’ in CNNs as largely playing the role of preprocessing stages, conducted for dimension reduction, which are easily adaptable to other approaches, including polynomial models.”
X. Cheng, B. Khomtchouk, N. Matloff, and P. Mohanty, P.Polynomial Regression as an Alternative to Neural Nets, July 2, 2018. arxiv.org/pdf/1806.06850.pdf (PDF)
[4] Goeffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. Efficient learning of sparse representations with an energy-based model. In B. Scholkopf, J. Platt, and T. Hoffman, ¨ editors, Advances in Neural Information Processing Systems 19 (NIPS’06), pages 1137–1144. MIT Press, 2007.
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Bernhard Scholkopf, John Platt, and Thomas Hoffman, editors, ¨ Advances in Neural Information Processing Systems 19 (NIPS’06), pages 153–160. MIT Press, 2007.
[5] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio. Why Does Unsupervised Pre-training Help Deep Learning?, Journal of Machine Learning Research 11, 2010. (PDF)
[6] Caid, W. R., Dumais, S. T., & Gallant, S. I. (1995). Learned vector-space models for document retrieval. Information Processing and Management, 31, 419–429.