Genesys Dialog Engine Bot Returning Wrong Intent After Upgrading to NLU V2 Model

push_force · January 15, 2026, 12:35am

Upgraded our Dialog Engine bot from the V1 NLU model to V2 as recommended in the deprecation notice. After the upgrade, the bot is consistently misclassifying intents that worked perfectly on V1.

Specifically, our “transfer_to_billing” intent is now being matched when customers say phrases related to checking their account balance. Before the upgrade, those phrases correctly matched “check_balance.”

Training data for both intents has not changed. “transfer_to_billing” has 25 training utterances. “check_balance” has 30 training utterances. There is no overlap between the two - I audited every single utterance manually.

Test results:

“I need to check my balance” → V1: check_balance (correct) → V2: transfer_to_billing (wrong)
“what is my current balance” → V1: check_balance (correct) → V2: transfer_to_billing (wrong)
“show me my account balance” → V1: check_balance (correct) → V2: check_balance (correct)

The confidence scores on V2 are also lower across the board - averaging 0.62 compared to 0.85 on V1. We are on us-east-1.

cx_maria · January 15, 2026, 1:36am

We experienced same issue during our NLU V2 migration. The V2 model uses different tokenization and embedding architecture compared to V1. V1 used keyword-based matching with TF-IDF weighting. V2 uses transformer-based contextual embeddings.

The practical impact is that V2 understands semantic similarity at deeper level, which means intents with semantically similar names or training data will overlap more. “billing” and “balance” are semantically close in the V2 embedding space.

The fix requires adjusting your training data strategy for V2:

Add negative examples. In each intent, add 5-10 utterances from the competing intent as explicit negative examples using the “Mark as irrelevant” feature in the NLU training panel.
Increase training utterance count. V2 needs minimum 40-50 utterances per intent to build reliable decision boundary. Your 25-30 utterances were sufficient for V1 keyword matching but too few for V2 contextual model.
After updating training data, you must click “Train” explicitly. V2 does not auto-retrain on utterance changes - this is different behavior from V1.

We manage this through Terraform using genesyscloud_flow_bot resources, and we version-control all training utterances in YAML files that feed into the NLU training API.

SofiaN_74 · January 18, 2026, 1:36am

The training data approach is correct but I want to flag something that wasted two days of my time on a similar issue.

Check if you have any Data Actions called within the bot flow that use the same variable names as your intents. We had a data action output variable called billingStatus in the same bot flow. On V2, the NLU engine apparently considers in-flow variable names as contextual signals for intent classification. When billingStatus was populated with a value from a previous turn, it biased the classifier toward the billing intent.

The workaround was to rename the data action output variables to use generic names like apiResult1 instead of semantically loaded names. Completely counterintuitive, but the confidence scores jumped from 0.65 back to 0.88 after the rename.

I have no idea if this is a bug or a feature. Genesys documentation says nothing about in-flow variables affecting NLU classification.

tim_e · January 20, 2026, 1:36am

Oh that is a fascinating finding about variable names! We have not seen that specific behavior, but it is consistent with how transformer models work - they will pick up on any contextual signal in the input window.

One more thing that helped us enormously during our V2 migration: use the NLU Testing panel to run a full regression test before deploying. Go to the bot flow editor > NLU tab > Testing. You can paste a CSV of test utterances and expected intents, and it will run them all against the current model and show you precision/recall metrics per intent.

We built a CI/CD gate around this. Our pipeline exports the bot flow, runs 200 test utterances against the NLU API, and blocks deployment if any intent drops below 0.80 F1 score. This caught three intent regression issues during V2 migration that we would have shipped to production otherwise.

The testing API endpoint is POST /api/v2/flows/datatables/{flowId}/nlu/test - not well documented but it exists and it is stable.