#GivenWhenThenWithStyle

How to specify something random?

Specification by Example works well for deterministic processes, where you can know an expected result upfront. The next challenge deals with a situation when there is no clear expected result – generating random numbers.

We have a chat-bot that needs to simulate human typing. Instead of sending the entire message at once, we want it to insert short pauses between each letter, say between 0.2 and 0.5 seconds. And these pauses should vary randomly, so the human on the other end would not suspect it’s chatting with a robot. How can we describe this with Gherkin?

Here is the challenge

Option 1: Specify overall process properties

One option would be to directly translate business expectations into a single statement, and hide the details of the test in the automation layer.

Given the bot receives a message "Hi there"
When the bot replies with "Hello"
Then the bot should insert random pauses from 0.2 to 0.5 seconds between each letter

Option 2: specify randomness in individual actions explicitly

Alternatively, we could expose more information in the scenario, and leave individual actions to prove for the automation layer.

Given the bot receives a message "Hi there"
When the bot replies with "Hello"
Then the pause between typing H and e should be random between 0.2 and 0.5 seconds
And the pause between typing e and l should be random between 0.2 and 0.5 seconds
And the pause between typing l and l should be random between 0.2 and 0.5 seconds
And the pause between typing l and o should be random between 0.2 and 0.5 seconds

Option 3: Test properties of a larger set

Although there’s no clear deterministic value for each individual action, non-deterministic processes may have some deterministic properties when applied to a large group. We could try specifying that instead of individual instances:

Given the bot receives 10000 messages
When the bot replies to all messages
Then the minimum delay between typing should be greater than 0.2 seconds
And the maximum delay between typing should be less than 0.5 seconds
And there should be at least 1000 unique delay values

Option 4: Fake the randomness for automation

Alternatively, we could make the process deterministic by providing a “fake” random number generator for testing purposes. In production, the system would use an actual random number generator. This would not prove end-to-end integration, but it would prove that we’re transforming the random number output (usually between 0 and 1) to the interval required by the scenario (0.2s to 0.5s).

Given the random number generator produces the following sequence: 0.9, 0.4, 0.1, 0, 0.6
When the bot replies with "Hello"
Then the folling pauses should be be inserted after:
| H | 0.47s |
| e | 0.32s |
| l | 0.23s |
| l | 0.20s |
| o | 0.38s |

Solving: How to specify that something should be random?

Last time, you voted for options to describe a process that involved randomness. The most difficult part of solving these kinds of problems is balancing specificity, reliability and relevance. We can make the test very specific, but then it may not be that relevant for the whole purpose. Or we can keep it easy to understand and relevant, but the tests would not be reliable.

Among the options offered in the original challenge, “Specify overall process properties” and “Test properties of a larger set” were the only options to get votes, and they got an equal number. Some of the reasons why people voted for them were that they were not too prescriptive, and that they leave developers with enough room to come up with the right solutions.

This time I’m going to disagree with community votes again. Both of these options aim for a relatively general description. The first isn’t particularly specific about individual actions, and the second isn’t particularly reliable as a test. In a way, they both miss the point. And the first misses it completely. But more on this later.

As an alternative, Andreas Worm proposed seeding the random number generator to make the scenarios both reliable and specific enough:

So when I create a robot with a seed “robot 17”, then the next three values are 0,2 0,1 0,2. And suddenly your test becomes deterministic again.

This is a common technical workaround for non-deterministic system elements. It’s a variant on the fourth proposed option (“Fake the randomness for automation”). However, this proposal and the original option 4 miss out on a big opportunity to improve the model. Any set of scenarios with too many boundaries, or too many key examples, is a symptom of a modelling problem. We can try to solve this in testing, but it will hurt a lot more than solving it in the model. Spotting modeling problems in examples can pay off big dividends later, and it’s important to pay attention to such signals.

Ken Pugh wrote a lovely blog post exploring this further, and suggesting to split the problem into several layers. On the top layer, he proposed an outcome “Then the reply has characteristics that mimic a human being”, which could then be broken down further into introducing delays, deleting parts of a message and so on. At a lower level, Ken proposes something similar to Option 4 from the original proposal. I like Ken’s approach, but I’d phrase it slightly differently. More on that a bit later.

First, we need to investigate the key reason why all the proposed options were difficult to compare: In essence, they test different things.

3S: Set Strict Scope

When faced with a relatively complex problem that needs to be specified with examples, the first tool I like to use is 3SSet Strict Scope. What exactly is this feature responsible for? What other features are related to this, but out of scope for now? Answering these questions will help us decide what to specify, where to test it and how much.

To set the scope for a feature file, and ensure that everyone is aligned on it, I usually draw a small informal process diagram. This helps us model what the system should do, instead of focusing too much on how it does things. For example, something like the picture below could describe the feature for sending messages with delays.

There are several moving parts here, and potential boundaries for testing. The options offered for voting in the challenge describe different parts of this diagram, from different perspectives. We can’t really judge if a scenario is better or worse than the alternatives without deciding how much of this diagram the feature should cover.

Let’s quickly analyse the options.

Option 1: Specify overall process properties

The first option tried to capture the entire process with a single statement:

Given the bot receives a message "Hi there"
When the bot replies with "Hello"
Then the bot should insert random pauses from 0.2 to 0.5 seconds between each letter

The relationship between inputs and outputs is really unclear here. Would a different message cause different delays? Would a different reply cause something else?

The initial message to the bot (“Hi there”) and the reply of the bot (“Hello”) are totally irrelevant for this test. They have no impact on the outcome, and just create noise. Inputs that have no relevant effect on the outputs are usually the result of misaligned scope.

Numbers 0.2 and 0.5 are a choice, based on some research around human typing, but they in no way depend on “Hi there” or “Hello”. If we want to focus the feature on how to configure the message sender, then a rephrased option 1 would be the best choice. For example, if typing longer messages should create longer delays than when typing shorter messages. If we want to focus on something else, this would be the wrong choice.

Option 2: Specify randomness in individual actions explicitly

The second option looked at the results of the message sending process, but it also involved the original client message, and the bot response.

Given the bot receives a message "Hi there"
When the bot replies with "Hello"
Then the pause between typing H and e should be random between 0.2 and 0.5 seconds
And the pause between typing e and l should be random between 0.2 and 0.5 seconds
And the pause between typing l and l should be random between 0.2 and 0.5 seconds
And the pause between typing l and o should be random between 0.2 and 0.5 seconds

This scenario, similar to option 1, suffers from irrelevant inputs. The message received by the bot (“Hi there”) is not important for the outcome in this scenario, suggesting that the scope is misaligned. The message in the action (“hello”) has some impact on the outcome, but it’s again unclear what that is. Would a different reply cause some other result?

The scenario tries to prove that the message sender inserts something between each letter in the response. Unfortunately, it’s too wordy and error prone. The implementation of the message sender could get just one random number at the start and always use it for delays. The messages would come out deterministically, defeating the purpose of the feature, but the test would still pass.

Option 3: Test properties of a larger set

Option 3 attempts to focus just on the message sending process, ignoring individual messages:

Given the bot receives 10000 messages
When the bot replies to all messages
Then the minimum delay between typing should be greater than 0.2 seconds
And the maximum delay between typing should be less than 0.5 seconds
And there should be at least 1000 unique delay values

This scenario is actually testing the random number generator, through an intermediary (the message sender). It tries to prove that we are generating numbers with the right distribution and confidence. The action “when the bot replies to all messages” has no impact on the outcome – we could easily remove that line and the scenario would still make sense (or at least as much sense as it did before).

This scenario is going to be unreliable in test automation. Random numbers are, well, random. The probability for 10000 random numbers to contain 1000 different values is high, but not guaranteed. Tests like this one might fail occasionally, although the feature may work well. Also, there are lots of ways for this test to pass while a serious bug breaks the functionality. For example, a sequence where the first 9000 numbers are all the same and the last 1000 numbers are different would pass this test. So would a continuously incrementing array. Both would it would make the system predictable and defeat the purpose of the feature.

So is this in the scope of our feature? It depends. Do we want to prove the randomness of the numbers, or focus on how the message sender uses the random number generator? If the feature is supposed to just make the system a bit less deterministic, perhaps we don’t need to specifically test the distribution of numbers. On the other hand, if we want to run a government-backed lottery, proving the actual distribution of randomness becomes quite important.

Option 4: Fake the randomness for automation

The final option in the challenge focused on the message sender, and assumed that we can control the random number generator for testing purposes:

Given the random number generator produces the following sequence: 0.9, 0.4, 0.1, 0, 0.6
When the bot replies with "Hello"
Then the following pauses should be be inserted after:
| H | 0.47s |
| e | 0.32s |
| l | 0.23s |
| l | 0.20s |
| o | 0.38s |

Similar to the previous options, this one suffers from the scoping problem. The relationship between the inputs and the outputs isn’t particularly clear. How exactly does the first sequence of numbers lead to the second? On the other hand, unlike all the previous options, this one does not lead to brittle tests. Instead of testing the random number generator, it presumes that we can inject the numbers into the testing system. Going back to the question of scope, this option only covers the message sender, and explicitly leaves out the actual random number generator out of scope.

Isolate sources of variability

The fourth option is an example of a common trick to deal with testing complex systems: Isolate sources of variability. Split the scope into several features, and prove them separately. Then write an integration test (usually a technical one, outside the scope of Given-When-Then tools) that proves the link between the various aspects. The isolation trick is particularly useful if one part of the system is not deterministic. By isolating that from the rest, we can make the remaining specifications and tests deterministic.

Technically, the way to achieve such isolation in testing is often to use mocks, or some special configuration preset (“Bot 17”). I strongly suggest modeling this in properly. Similar to the solution for How to deal with pauses and timeouts?, where introducing the concept of a business clock simplified both the system model and the tests, modeling sources of variability as first-order domain concepts can lead to a more flexible design, and better software. Once we capture this collaborator and its interactions in our model, we could provide an implementation that uses a certified hardware random number generator for systems where a high degree of confidence is required, one that uses the standard CPU pseudo-random generator where we just want to make it a bit less deterministic, or a fully predictable option for testing purposes.

By setting a strict scope, and isolating sources of variability, we can create scenarios that are focused, clear, and that we can automate reliably.

What’s our scope?

So, to answer the original challenge, let’s set strict scope. Do we want to actually prove the distribution of random numbers? Is the delay different for different types of messages? Or do we just want to prove that given some message the appropriate delays get inserted between letters? Let’s focus on on the message sender itself for this feature. The “bot” part is actually out of scope, as is the original message from a user. The random number generator is also out of scope.

We need to prove that the sender injects an appropriate pause between each character when sending, to make the system mimic human behaviour. The phrase “between 0.2 and 0.5 seconds” is not an output, as in most of the offered options, but an input for this scope. It’s how the message sender knows how to transform the numbers coming in from the random number generator.

I’ll use the version Ken Pugh proposed in his post, slightly reworded to match the diagram above and take the new configuration parameters into consideration:

Scenario: Message sender inserts delays between characters

  To make users think they are talking to a human operator, 
  the bot should mimic a human being when replying

Given the message sender is configured to add pauses between 0.2 and 0.5 seconds
And the message sender needs to send the message "Hello"
And the random number generator produces the following values:
| value |
| 1.0   |
| 0.9   |
| 0.4   |
| 0.6   |
| 0.0   |
When the message sender dispatches the message
Then the pauses between characters should be:
| character | pause |
| H         | .5 s  |
| e         | .47 s |
| l         | .32 s |
| l         | .38 s |
| o         | .2 s  |

In case of multiple scenarios using the same message sender configuration, we could move the first line (“Given the message sender is configured…”) to a background section, and make the scenario even simpler.

Of course, any scoping decision leaves some things out. If there is an actual need to prove the randomness of the RNG system, this could be done by separate tests (often beyond the scope of Given-When-Then for serious testing).