Resources
Home > Resources
OUR RESOURCES
Structured Language Processing
Introduction
Firehorse is the first company in the emerging Structured Language Processing industry. Currently, the market for processing documents focuses on the two extremes: Forms & Natural Language. The technologies of the day are optimized for these two extremes and do a good job, but they are not optimized for all the documents and language that happen in between, what we call Structured Language.
In fact, nearly every pdf exchanged between businesses or between a consumer and a business is a Structured Language Document. A pdf document is meant to be read and understood and so there’s intelligence built into the formatting of it. The most important items are at the top of the page. Like items are grouped together. There’s metadata encased in the headers and footers. Those things are the same whatever RDBMS and reporting system created the pdf. That means that pdf documents from different entities are more often alike than not, making those pdfs a good way to exchange data. In many cases, if the OCR is good enough, it’s easier to extract data from pdfs than it is to build APIs.
Rigid Forms
These include tax forms & applications. All these forms have the same number of pages. and every elements is in the same place on the page.
Variable Size Feild
These include the universal Residential Loan applications. Here certain sections might balloon bigger or smaller depending upon the number of items.
Category 1 Document
Category 2 Document
Structured Documents
Formal But Unstructured Documents
Category 3 Document
Category 4 Document
Our snowflake ℠ algorithm can categorize each document after seeing two specimens. Our proprietary flow ℠ technology platform can abstract data from Category 1 – 3 documents in most cases on the first try. Category 4 documents require the customer to provide a list of data items wanted. Once we have that list, flow ℠ can abstract data in most cases on the first try.
To understand how flow ℠ works, it helps to compare against other technologies. There are at least a dozen companies that have solved the problem of extracting data from Category 1 Documents or Rigid Forms. They use spatial relationships on the pdf page itself to extract data with the expectation that the data will be on the same place on every page. This model works for this very narrow application with accuracy levels approaching 100% for at least half a dozen providers that we have tested. The challenge for most of these service providers is that building a model is time-consuming as evidenced by the one-time model fees charged which can range from $180 to $3,000 or annual minimums which can be $6000 per Category 1 Document.
Providers have tried to expand this technology into other Categories with mixed success. For example, a common method is to look for a floating field name and bring in the value below or to the right of that field. This works well enough to make operations more efficient but still requires human intervention. Publicly available reviews and case studies suggest that accuracy levels are 75 – 90% for Category 2 documents.
At the other end of the spectrum are Natural Language Processors. Because of the complexity of Natural Language all these companies use AI or Machine Learning to abstract information from a document. These technologies traffic in probabilities. There are many possible answers and the model picks the most probable one. Natural Language Processors do a very good job with Category 3 Structured Documents. Before Firehorse, AI companies were the only solution for Category 4 documents. They do a serviceable job but at a high price.
Pure Natural Language Processors are not very accurate with the other Categories of Structured Language documents. Most companies actually combine floating field techniques with Natural Language Processing to get to those 75 – 90% accuracy levels. Natural Language Processors, AI and Machine Language companies for the most part focus on harder documents, either handwriting or poor quality pdf scans. When every customers’ automation processes are built already for a significant amount of human intervention, Automation provider’s lowest hanging fruit is to improve bad document OCR rather than making good document OCR better.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore.