Annotation Do's & Dont's

Your model is as good as the data provided to it during the training phase. Its important to have the right quantity of data, right representation of the real world data is picked for training your model. Its equally important to label and annotate the data the properly to get accurate results

📌 All labels should be present in all samples for training.

All the labels that the model to be trained on should be annotated in the entire training set. Lets say you need to extract 10 labels from W2 and you got 50 training files. All the 50 documents should be annotated for the 10 labels. all the Inconsistent labels across training data will bring down the model accuracy

Highlighted shows the label counts

📌Name of label should be same for similar fields across the training data

Lets take an example, If you need to label Social Security Number, you are free to pick any name for the label, but the same name should be used across all the documents

SSN is labelled as 'EmployeeSSN' in one case and 'SSN' in another case

📌Annotate only the text portion of the area of interest

Don't include too much white space around your label when you annotate a text. This noise will reduce the over all model accuracy

Too much white space in the firs case

Last updated

Was this helpful?