Many in the world—and I count myself among them—are well into the era of verbally conversing with machines. I speak, of course, about voice-enabled digital assistants. Whether you address yours as “Siri,” “Google,” “Alexa,” or something else, controlling and querying computers in the way Gene Rodenberry imagined in the original 1966 Star Trek series is now a typical piece of 21st century life. My grandchildren will never know a time without conversant boxes sitting on the counter.
Several significant technological advances made voice assistants possible. One advancement many may not realize is the development of wide-field sound capture. We all have experienced the sensation that someone using the microphone on their laptop is sitting in a barrel. Often it was ridiculously difficult to understand what the speaker was saying, and turning up the volume only made it worse.
Engineers solved the clarity problem by installing more than one microphone on a device and digitally processing the sounds those microphones hear into a clean representation of the speaking individual eliminating background noises and echoes. With these advancements, talking to the device from across the room now works just as well as standing over it.
Transformation of the question or command into the text a computer can process is another critical advancement. Clear audio is key, but efforts to train recognition models in a speaker-independent way have done wonders in getting reliable results regardless of gender, accent, and other vocal differences. They are not so good at decoding three-year-old speech yet, but then again, neither am I.
Those advances lead up to the main one: natural language processing. NLP is how systems make sense of a string of words and, at least in most cases, figure out what they mean and, by extension, figure out what is supposed to happen. As I understand it, beyond simple grammatical deconstruction, most voice assistant language understanding derives from trained, machine learning AI models.
Regardless of the exact implementation or use case, NLP got a lot better over the last couple of years as Amazon, Google, Apple, and others employed armies of researchers and engineers to perfect these conversant systems.
Natural language is the very definition of a text document. Therefore, advances in NLP mean advances in our ability to process the content of a document intelligently. Granted, some documents, like a filled-in form, hold little free form text and will not benefit from NLP. Names, part numbers, prices, and other fixed bits of standalone text are not language. NLP works when there is grammar—sentences, and paragraphs.
In 2019 Daniel Otter, et al. described the current state of NLP text processing research. As noted in the paper, the science of NLP defines four core areas of processing: language modeling, morphology, parsing, and semantics. Language modeling and morphology have limited application in document automation. However, semantics, understanding the meaning of words, phrases, sentences, or documents at some level, and parsing, examining how different words and phrases relate to each other within a sentence, bring us some beneficial capabilities.
Semantics is the core of classification, the act of figuring out the purpose of a given document. When using a tagged, representative set of documents to train a model as a given class, semantic processing teaches the neural network with meanings derived from the samples. Once the model is active, the neural net compares semantics extracted from an incoming document and delivers a probability that the document belongs to one of the trained classes.
It may be practical to include real-time training depending on the use case. Documents can exit the AI neural net with low confidence for several reasons. Two of the most common are untrained document types or document content that varies enough from the trained samples to return low confidence. Real-time training involves routing the document to humans competent and capable of properly tagging the document and sending it back into the AI machine learning mechanism. When properly done, the model gets better over time.
Once a system understands the context of a document—its type or class—the next opportunity is to extract useful data. That data may consist of metadata, data about the document often used for cataloging and retrieval, or process data, data valuable by humans and systems.
NLP systems extract entities from text. Entities are data elements identified in some fashion by the text surrounding them. For example a mortgage contract could include the mortgagee, the mortgagor, the property address, the term of the loan, the amount of the loan, several different dates, and so on. A trained and configured NLP model uses several parsing techniques to find those entities and return them, again with a confidence value, to the interested person or software application.
Returning functional classification and extraction results from NLP still requires supervised learning (tagging) and the use case-specific model development. Research is underway to better use unsupervised learning and apply deep learning to the NLP challenge. I suspect there will be a day when a generic model will deliver useful classification and extraction results. We are just not there yet.
NLP has come a long way since Joseph Weizenbaum created Eliza. That program, like a voice assistant, was conversational. Garnering helpful information from text differs from a conversation, but text-based applications benefit from the incredible amount of work being done across the NLP discipline.