A quick summary of 7 important DO’s and DON’Ts when training an NLP model for a chatbot. They are best applied before starting a project, but can also help to build a mindset for quality training data in all chatbot project phases.
DO’s and DON’Ts
✅DO: think in problem space, not in solution space
Users typically think in problem space, not in solution space, and so should you. As a quick example, consider the case of a user who ordered a shirt in an online shop and wants to know when it is expected to arrive. Consider this question:
- when will my shirt arrive
This is a question from problem space, describing the problem the user wants to be solved, while these are from solution space:
- what is the estimated shipping time
- show me the order status
They are describing how your business will react to the problem.
Benefit: the chatbot and the users speak the same language
❌DON’T: overload your intents with too many problems
As rule-of-thumb, your intents should handle at most 3–6 user problems as described above. For each problem you should provide at least 3 user examples. Put your focus on the essence of the intent — the solution your chatbot can provide for your users.
Benefit 1: content stays maintainable and focused
Benefit 2: separation of concerns makes dialog building straight forward
✅DO: clear separation of intents vs entities
To our surprise it is still a very common pattern to intermix the concepts of intents and entities, and we strongly suggest to stop doing it. Consider a real-life example of a fashion store which has trained an NLP model with the 3 intents
In this case there is room for exactly 1 intent (order) and 3 entities (shirt, pants, socks). Data scientists training the NLP model maybe won’t notice a real difference, but your developers will be grateful when coding dialog flow and fulfillment based on the NLP model output.
Benefit: maintainable and clearly defined NLP model output
❌DON’T: repeat sentence patterns in training data
When thinking about the question how much training data is sufficient ? you have the resist the general answer the more the better. Having training examples following the same patterns like
- order me a shirt
- order me some shirts
- order me shirts
in best case don’t help your NLP model in classification and will in worst case even have negative effect by overfitting your NLP model (but to be honest, when using a state-of-the-art pre-trained NLP model this is usually prevented out-of-the-box).
Benefit: keeps your training data small and focused
✅ DO: vary sentence structure and key terms
Instead of repeating same patterns you absolutely should vary the sentence structure for teaching the NLP model different ways of a user expressing the problem — here are some good training examples:
- order me a shirt
- need a new shirt
- dress me up with a fancy new shirt
Depending on the domain it may even make sense to use a thesaurus, but — IMPORTANT — only on entity and key term level: everything else a state-of-the-art NLP model will learn itself. A special thing to consider here are country-specific variations.
Benefit: makes classification robust for variations
❌DON’T: train the model with misspelled data (but prepare for it)
This is one is obvious — especially for entity resolution some kind of spellchecking not only in training but also on live inference is a must. But also for intent classification the NLP model will in worst case learn rubbish.
Benefit: makes classification robust for real user input
✅DO: edit and use real user input as training data
While you shouldn’t blindly copy&paste real user input to your training data, it is without any doubt the most valueable source of training data and future improvements of your chatbot’s understanding. As long as unsupervised learning for NLP tasks is still in it’s infancy, having some kind of manual interception and editing process in place is a must to establish continuous learning.
Benefit: improves the quality of your NLP model with each interaction
You can find information about how Botium can help in our Wiki and in our Blog: