#GivenWhenThenWithStyle

Handle pauses & timeouts effectively

This challenge is dealing with wait statements and pauses in specifications.

My testers keep adding pauses into Given-When-Then. They claim it’s necessary because of how the tests work. I don’t like it because the numbers look random and do not reflect our business rules. Also, sometimes pauses aren’t long enough. The tests sometimes fail although there are no bugs in the system version we’re testing, because the pauses aren’t long enough. How do we avoid that?

Here is the challenge

This problem is symptomatic of working with an asynchronous process, often an external system or an executable outside of your immediate control. For example, think of an e-commerce platform that passes orders through a risk evaluation module before confirming. The risk check might take a few seconds, and the user may not be able to pay until the order is confirmed. In a visual user interface, this waiting period is often shown with a spinner or a progress bar. A bad, but common way of capturing this in Given-When-Then is to add a pause between actions:

Given a user checks out with a shopping cart worth $200
When the user submits an order 
And the system waits 3 seconds
And the user authorises the payment
Then ...

In some cases, it’s not the system under test requiring pauses, but the way a test is automated. A common example is executing tests through a web browser automation tool such as Selenium or Puppeteer. The framework can load a web page using a browser, but that page might need to fetch external javascript files or additional content to fully initialise the user interface. Triggering an action immediately after a web page loads might cause the test to fail, because the related elements are not yet available. Adding pauses to Given-When-Then scenarios is also a common work-around for these kinds of issues:

Given a user registers successfully
When the account page reloads
And the user waits 2 seconds
Then the account page displays "Account approved"

In both these situations the testing process requires a bit of time between an action and observing its result. But the actual period will vary from test to test, even between two executions of the same test. It will depend on network latency, CPU load and the amount of data required to process. That is why it’s not possible to correctly select the period duration upfront. Choosing a short pause causes tests to occasionally fail because the asynchronous process sometimes does not complete when the test framework moves on to the next step, so people have to waste time chasing ghost problems. Setting a pause that’s long delays feedback and slows down testing excessively.

How would you rephrase and restructure this instead?

Solving: How to deal with pauses and timeouts?

The challenge was to improve specifications that deal with pauses, in particular those that wait for a period of time. For a detailed explanation of the problem, check out the original post. This article contains an analysis of the community responses, two ways of cleaning up the problematic specs, and tips on more general approaches to solving similar problems.

Some people noted that Given-When-Then tools aren’t the right solution for testing systems requiring synchronization. However, issues with waiting and synchronization are general problems of test design, not something specific to Given-When-Then tools. The design ideas outlined in this article are applicable to other classes of tools, and other types of tests as well. However, there is one specific aspect of Given-When-Then that is important to consider first: where to define the pauses.

Move waiting into step implementations

Lada Flac, commenting on the problem of asynchronous web page elements, suggested pushing the pauses into step implementations:

“I don’t see a need to add the waits to gherkin steps. I would add them only to the implementation steps. And those would not be implicit waits, but those where a script would try to find an element until the timeout.”

Lada is spot on. The most common cause for waiting in tests is the way a test is executed. For example, testing through a browser, or over a network, implies asynchronous network operations so it may require synchronisation. In cases such as that, it’s best to place the waiting in the step implementations, not the scenario definition. Describing a flexible periodic polling process with exponential back-off is huge challenge using Given-When-Then, but it’s very easy to do with C#, Java or any other programming language.

Kuba Nowak suggested a similar way to rephrase the waiting, using steps such as “And waits until payment page opens”, and then dealing with the meaning of “opens” within the step implementation.

Moving the waiting into step implementations makes the scenario definitions shorter and more focused on the problem domain. A team can discuss such scenarios more easily with business representatives. By moving the pauses into step implementations, as Lada and Kuba suggest, we can also postpone the decision on how to perform them. This gives us the option to avoid waiting for time. Jonathan Timm nicely explained it:

Regardless of the tool, automating user interaction needs to be a two-way conversation between the application under test and the automation code. Human users take cues from user interfaces unconsciously, and the same cues can be listened for with code using existing native functionality in tools like Selenium.

Jonathan noted that WebDriver, a popular user interface automation tool, supports waiting until an element becomes visible or clickable. Instead of pausing for a specified period, the implementation of a step can pause until some interface elements appear or become active.

Whenever possible, use this trick: instead of waiting for time, try waiting for events. Pausing for a pre-defined period should be the technique of last resort, applied only when there are no other ways of handling the synchronization.

Wait for events, not for time

David Wardlaw wrote a nice blog post with his thinking about the problem, documenting several ideas with increasing complexity. One of the very important things he noticed is that waiting for time is generally not a good idea since the background process can depend on many different factors. David wrote: “The timing can be affected by things like CPU usage and network traffic and you may get inconsistent test results”. This is especially true for test environments, which are usually underpowered, and sometimes shared across teams.

Stan Desyatnikov proposed a way to rephrase the waiting scenarios. “Don’t reflect delays (of the application under testing) in steps of scenarios… Wait for a particular condition to be met…”. Most of the responses to this challenge correctly identified some other condition for waiting, often related to the user interface. David Wardlaw suggested looking at the user experience, for example specifying the condition “When the account page loading progress bar is at 100%”.

Faith Peterson wrote a very detailed blog response with excellent ideas, proposing to “raise the level of abstraction and ignore the pause”. Faith explains it:

“This has worked for me when my primary interest is verifying the human-observable result of an integrated system, and I’m less interested in verifying internal operations or handoffs.”

I fully agree with Faith, but it’s important to note that this trick can work for a much wider set of contexts than just human-observable results. As Jonathan noted, people subconsciously take cues from a user interface, so it’s easy to think about human-observable results. But we can, and in most cases should, raise the level of abstraction even higher. Mathieu Roseboom looked in this problem from the perspective of relevance:

“The fact that a step needs to wait for a certain amount of time should not be part of the scenario in my opinion. It is mostly (there are always exceptions) not relevant for the business.”

A progress bar reaching a certain percentage or a page loading fully is definitely better than specifying a pause, but it’s still implementation detail, not a core business requirement. Combining Faith’s idea about raising abstractions and Mathieu’s idea of relevance into work, Dave Nicolette suggested the following improvement to the first scenario in last week’s challenge:

Given a user has $200 worth of goods in their cart 
When the user completes the purchase 
Then..

Dalibor Karlović suggested replacing the waiting statements in the second scenario with a meaningful business event, step such as:

And the account verification completes

Both these suggestions frame the waiting period in terms of an event from the business domain, not user interface.

So how do we decide if the condition for waiting should be described in terms of user interface elements, some more generic user observable behaviour or a business process? The answer is in one of the most useful techniques for clarifying scenarios: focus on what, not on how.

Focus on what, not how

There are usually two dimensions of relevance in Given-When-Then scenarios and related tests. The first is the purpose of a test, the second is the process of testing. For example, in a specification of the registration process, observing the progress bar is not relevant for the purpose of a test. It may be relevant for the process of testing. On the other hand, in a specification for user interface interactions, the progress bar activity is relevant for the purpose of the test as well.

When scenarios focus on how something should be tested rather than what a feature should do, then the purpose of a feature is obscured, and the scenario depends too much on a specific implementation or system constraints. When the implementation changes, or when the code executes on a different system, tests based on those specifications often start to fail although there are no bugs in the underlying code. As a general guideline, try to keep the scenario definitions focused on things relevant for the purpose of the test, not on the mechanics of test execution. Move the mechanics into step implementations, into the automation layer.

Sometimes, the distinction can be very subtle. For example, Stan Desyatnikov proposed restructuring the second scenario from the challenge in the following way:

When a user registers successfully
Then the account page displays "Account approved"

In cases such as this one, it’s interesting to consider whether the account page is relevant for the purpose of the test, or is it just there to explain the process of testing something. We could rephrase the post-condition as just one line:

Then the user account status is "approved"

By specifying the wait condition in the terminology of the business domain, instead of test execution, we can postpone the discussion on test automation. The scenarios will be clearer and easier to discuss with business representatives. We can potentially optimise the execution later so it can go below the user interface, or even avoid asynchronous issues altogether.

Bring the interactions into the model

Time-based pauses in scenarios are not necessarily symptoms of the test process leaking into specifications. They could also be caused by a wrong model, which in turn causes bad user experience and makes the system error prone. Any web site that warns against pressing the back button during a background process is a good example of this. The worst offenders are airline ticket sites with warnings that you mustn’t close the browser window during payment. Why not? As if the user watching some web progress bar is magically going to improve payment approval rates. Connection problems are a fact of life on the Internet, especially with consumer applications. Wi-Fi signals drop, phone batteries die, and people close windows by mistake.

Showing a warning against something users can’t control doesn’t make the problem disappear. When an asynchronous process is fundamental for the purpose of the feature, not just for the process of testing, then we don’t want to hide it into step definition code. This is not accidental technical complexity, it’s a fundamental property of the problem domain. We want to expose such domain properties, so we can openly discuss them and define what should happen when the situation develops in a predictable but unwanted way.

René Busch suggested rephrasing the registration scenario in terms of events, such as the one below:

Given the process of registration is running
when the registration process completes successfully
then the user gets a notification the registration is completed successfully
then the user can confirm that registration

Given the process of registration is running
when the registration process completes in error 
then the user gets a notification the registration is completed with failure
then the user can do/see ....

The original challenge had two example scenarios. The second was stuck in waiting purely because of badly described user interface constraints. But the first one, dealing with order approvals, is a lot more tricky. In the challenge post, I explained that “platform passes orders through a risk evaluation module before confirming”. This is a hint that there might be a fundamentally asynchronous problem in this domain. For example, although an automated risk evaluation module might take a few seconds most of the time, fraud prevention can also require escalating to a human investigator, which might take hours or days.

In such cases, the events and notifications are what we need to test, not how we’re testing something. They need to be explicitly defined and included in the scenarios. This allows us to specify examples when the risk evaluation is not yet complete, without worrying how long the process actually takes. Treating the process as fundamentally asynchronous allows us to design better user notifications, improve user experience, system operations and support.

If a risk review can take a while, there is a good chance that a user will want to view the status of an order during that process, or make new orders. Identifying such important domain events might lead to further refinement of the underlying software model, and asking more interesting questions. For example, is the risk assessment the only thing that happens before an order is confirmed? Perhaps there are other things that need to happen, such as checking the inventory. Alternatively, we can start asking questions around what happens when the order confirmation is pending, and when the order confirmation resulted in a negative response. This might lead us to discover some further events, such as the order being rejected. We might need some additional examples around that. Discovering domain events is key to modelling asynchronous systems correctly. For some further background on this, check out Alberto Brandolini’s work on Event Storming.

Similarly to how identifying user interface events (such as a button becoming visible) helps to automate tests better with UI drivers, identifying domain events is important because it allows us to introduce additional control points into the system. René’s example shows this nicely. Instead of actually running an external registration process, the test framework can just submit the appropriate results synchronously. Such tests will run faster and more reliably. We can also avoid the complexity of setting up data for an external system, and keeping it up to date as the external system changes. In the best case scenario, this can also turn a test case that was previously asynchronous into something that can run synchronously. Of course, it’s good to complement this with a proper integration test which shows that the system under test and the registration process communicate correctly, but this is a technical concern that should be handled outside Given-When-Then feature files.

Modelling time

Specifying what and automating how something is done is a great technique to decide if something should go into scenario definitions or step implementations. If what we’re specifying is user interface, then user interface interactions should stay in the scenario definition. If what we’re specifying is a business process, and the user interface is just how we’re testing it, then user interface interactions should go into the step implementations. This also applies to time.

Most responses to this challenge correctly suggested moving away from specific time periods, but there are cases where the passing of time is actually what we’re testing. David Wardlaw had one such example in his post, specifying that user registration must complete within three minutes, otherwise the process should fail and users need to be notified about an external system not being available. Performance requirements, service level agreements and operational constraints often involve specific time limits, and in such cases the period itself needs to be visible in the scenario. This will allow us to discuss other examples, probe for hidden assumptions and change rules in the future more easily.

However, even in such cases, it is usually wrong to implement the test mechanics by waiting for a period of time. With longer periods, such as checking that a regulatory report is generated at the end of every quarter or that transactions are reconciled at the end of the business day, people tend to avoid automating the tests at all because they don’t know how to handle the waits. With short periods, such as seconds or minutes, sometimes people are tempted to just block the test and wait it out. Please don’t. That makes testing unnecessarily slow.

If time is critical to an aspect of your system, model it in the domain and represent it as business time. Technically, this often involves creating a wrapper around the system clock, with methods to schedule and wait for events. Anything else that depends on time should not use the system clock, but connect to the business clock instead. That allows the test system to easily move forwards and backwards in time and prove complex business rules, without delaying the test execution. This is the right approach for dealing with time-based events, such as expiring unpaid accounts if more than two hours passed since registration, generating end of day or end of quarter reports, or generating monthly statements. We can easily move forward a whole month, or stop just one second short of a month start, to prove that the right things happened or did not happen.

When a business clock runs the show, there is only one time definition, so everything can be synchronous to that clock. This often means that test automation turns to synchronous execution, which is much more reliable and resilient than working with asynchronous events.