Tutorial: Analyze and Improve Performance of IBM Watson Assistant Skill

Audience: If you have an existing Watson Assistant skill and you want to analyze it for consistency as well as performance of the training data itself, you should read this article. It shows ways to improve the performance of your skill as well. We will use Botium Box to:

  • download the training data from your Watson Assistant skill to Botium Box
  • run static and dynamic analytics on it
  • present the results in the Botium Coach dashboard
  • augment the training data to improve the performance
  • upload the training data from Botium Box to your Watson Assistant skill
  • and finally validate the improvements

Attention: In machine learning, it typically makes no sense to use training data for testing, but using the outlined approach you will detect any serious flaws in the training data itself. It is not possible to tell how your skill will work out in production later, for this you have to invest some more effort to prepare good test data (or use the data sets included in Botium Box).

Step 1: Open a Channel to your Watson Assistant Skill

Open the Chatbots menu in Botium Box, click the Register New Chatbot button and select IBM Watson Assistant API in the technology selection field. You can find everything you need in your IBM Cloud Console – in most cases you will authenticate with an IAM API Key – see the IBM docs how to get yours. It is important to select Assistant V1 as SDK Version.

Register Chatbot in Botium Box

When everything is in place, click the Say Hello button to verify your credentials.

Step 2: Use the Test Case Wizard to Download Training Data to Botium Box

Open the Test Case Wizard, expand the Conversation Model Downloader section, enter a name for the new Test Set and click on Download from IBM Watson Assistant. Make sure the Create new Test project with this Test Set is enabled to save you a few extra clicks later.

At this point, Botium Box will download the intents and user examples from your skill and build a Test Set in Botium Box out of it. It will do a static analysis of the training data as well – in case it identifies obvious problems (such as duplicate utterances, empty user examples list, …) it will immediately warn you about it. You can now see the Test Set statistics.

By switching to the Test Cases tab, you should now see something familiar – the intent list as you named it in your Watson Assistant skill, as well as the user examples for them!

Step 3: Run a First Test Session

In the Test Set Dashboard, click the Start Test Session button and watch the test session progress. Botium Box will send all of your training data one after the other to the Watson Assistant skill and notice any irregularities – for example if a user examples resolves to another intent as expected.

Attention: Depending on the Watson Assistant plan in place you will have to pay for each API call!

Step 4: Open the results in Botium Coach Dashboard

In Botium Coach Dashboard you will now receive some hints what might be wrong with your training data. You won’t get much out of the confidence score evaluations, as we are testing with training data, so anything but a very high average confidence score would be really surprising (or alarming).

You can have a look a the confusion matrix, but for the same reason as above, you won’t really get valueable hints there.

You definitly should have a look into the Mismatch Probability Risks section – this is about user examples with a high risk of matching an incorrect intent, which usually would trigger a disambiguation in Watson Assistant (if this feature is enabled) – this should’t happen for training data. Botium Coach evaluates on the alternate intents lists to show you any intents and user examples which are very likely to cause a mismatch due to similar confidence scores. In the example below you can see that the user examples “good day”, “can I teach you” and some more have a mismatch probability score of 1, which means that Watson Assistant is not able to distinguish between two intents.

Clicking inside the chart will show the alternate intents list Watson Assistant returned for this user example – so there is definitly something odd with your training data – most likely, the user example “good day” is included in both of the intent user examples.

Other valuable insights you can tet in the Intent Mismatch Probability Risks, Alternative Intents list – this will highlight any intents that are often appearing in the alternate intents lists of the other on a rather prominent position.

Step 5: Augment Training Data

You can now choose to:

  • augment your training data in your Watson Assistant workspace and then download the augmented training data to Botium Box again
  • or use the included Botium Box tools to augment the trianing data in Botium Box and upload it to your Watson Assistant with the Test Case Wizard

When doing it in the Watson Assistant workspace, use the Test Case wizard again to either overwrite the Test Set data or create a new one.

When doing it in Botium Box, you can use the Botium Box tools to augment the training data:

  • Use the Test Case Designer to manually add and remove user examples to the training data
  • Use the Paraphraser to generate additional user examples and add them to the training data
  • Merge training data from the included Botium Box data sets, available for various domains in various languages
  • Merge training data from other sources

For example, you can now identify the intents for which there is not enough training data available (less than 10 user examples), and use the Paraphraser to generate more:

To upload the additional training data to your Watson Assistant workspace, open the Test Set Dashboard, click on Upload Conversation Model to Chatbot Provider and select your IBM Watson chatbot as registered in the first step of this tutorial. You can now choose to

  • Create a new blank workspace
  • Copy and extend the existing workspace
  • Merge user examples into the existing workspace

All three of them are valid choices, but I would not recommend the third option for workspaces that are currently live, for obvious reason.

Step 6: Validate Improvements

Again, run a test session with your augmented training data and open it in Botium Coach when ready. In Botium Coach, you can select a secondary test session for performance comparision. It will show your the intents and user examples that now are working better or worse than before.

As usual in Botium Coach, you can drill down to the single user example level to trace any improvements or deteriorations.


While it usually makes no sense to test a machine learning model with data it has been trained on, it can help to visualize issues with the training data itself, such es mismatch proabilities and duplicate user examples. Flaws in the training data will have a negative impact on the overall NLU performance of your chatbot for sure.

Take your Botium Coach Test Drive today!