Introduction
Firehorse is the first company in the emerging Structured Language Processing industry. Currently the market for processing documents focuses on the two extremes: Forms & Natural Language. The technologies of the day are optimized for these two extremes and do a good job but they are not optimized for all the documents and language that happen in between, what we call Structured Language. In fact, nearly every pdf exchanged between businesses or between a consumer and a business is a Structured Language Document. A pdf document is meant to be read and understood and so there’s intelligence built into the formatting of it. The most important items are at the top of the page. Like items are grouped together. There’s metadata encased in the headers and footers. Those things are the same whatever RDBMS and reporting system created the pdf. That means that pdf documents from different entities are more often alike than not, making those pdfs a good way to exchange data. In many cases, if the OCR is good enough, it’s easier to extract data from pdfs than it is to build APIs.
Evaluating an unsupervised NLP model
Firehorse is the first company in the emerging Structured Language Processing industry. Currently the market for processing documents focuses on the two extremes: Forms & Natural Language. The technologies of the day are optimized for these two extremes and do a good job but they are not optimized for all the documents and language that happen in between, what we call Structured Language.
Examining model predictions (error analysis)
Create a baseline: Creating a baseline before diving into experimentation is always a good idea. You don’t want your BERT model to marginally perform better than a TF-IDF + Logistic classifier. You want it to blow your baseline out of the water. Always compare your model with the baseline. Where does my baseline perform better than the complex model? Since baselines are generally interpretable, you might get insights into your black box model too.
Metrics analysis: What is the precision and recall for each class? Where are my misclassifications ‘leaking’ towards? If the majority misclassifications for negative sentiment are predicted as neutral, your model is having trouble differentiating these two classes. An easy way to analyze this is to make a confusion matrix.
Low confidence predictions analysis: How do examples where the model is correct but the confidence of classification is low look like? In this case, the minimum probability of a predicted class can be 0.33 (⅓)
Look at length vs metric score: If your sentences in the training data have high variability in lengths, check if there is a correlation between the misclassification rate and length.
Summary
I hope you found new ideas for acing your next NLP project.
To summarize, we started with why it is important to seriously consider a good project management tool, what does a Data Science project consist of – Data Versioning, Experiment tracking, Error analysis and managing metrics. Lastly, we concluded with ideas around successful model deployment.
If you enjoyed this post, a great next step would be to start building your own NLP project structure with all the relevant tools. Check out tools like: