Given-When-Then beyond features #GivenWhenThenWithStyle

Instead of posting a solution to the current topic, this week I’ll describe how to overcome some limitations of Given-When-Then tools, and when it’s OK to break the rules. (During the summer period, the challenges run every two weeks, to give you more time to respond.)

Early executable specification tools did not impose a lot of structure. Teams organized content in various ways, learning from mistakes. When Cucumber came along in 2008, the community already knew some common ways to succeed (or fail), so Aslak Hellesøy made the right choice to bake these ideas into the tool. Cucumber introduced just enough structure to prevent many usual mistakes, and this became the template for other Given-When-Then tools such as SpecFlow. Among other things, Cucumber introduced the idea of “feature files”, putting the focus on organizing executable specifications around features. This was great because it got people to think twice before structuring examples according to user stories or work task sequences. But it left three big open questions:

  1. How to capture specifications that are not about features?
  2. What to do with specifications that are not intended for automation?
  3. How to capture those that do not fit nicely into words?

In this post, I’ll give you some tips on how to handle those three types of scenarios.

Describe non-functional requirements using representative scenarios

In the traditional business analysis jargon, overall system aspects such as performance, scalability, usability are called “non-functional requirements”. Informally they are called “illities”, based on the “illity” ending that many of those names share. For those preferring formal categorisation, there’s even a fancy ISO standard, ISO/IEC 25010:2011, which categories them under seven main sections and a bunch of subcategories.

I always hated the name “non-functional”, because it tries to describe something by saying what it isn’t, kind of like saying “non-alcoholic” when referring to rat poison. This name is too vague and in many aspects incorrect, since non-functional requirements usually translate to a lot of functionality (scalability or supportability are trivial cases to confirm this, but even usability requires system functions). It’s perhaps better to consider such requirements as cross-functional, as they depend on a combination of functions rather than a single feature.

The risk for cross-functional requirements is spread across many parts, but we can only evaluate it on the system as a whole. A single feature passing performance tests doesn’t say much about the overall system speed, but that same feature working very slowly can kill perceived performance of the whole system. Because of that, the overall requirements don’t really belong to any single feature, but they should be considered when implementing almost all features. Given/When/Then tools usually push teams to organise documents by features, so it’s difficult to map cross-functional specifications. Unfortunately, a consequence of this is that the cross-functional requirements don’t get discussed as often as feature requirements, and they don’t get documented as consistently.

Many clarification techniques I’ve presented in this article series apply equally well to feature and cross-functional requirements. In fact, some of the higher-level system aspects can only be described through a series of representative scenarios. There are usually no binary right or wrong answers to confirm that they work. At the very least, the next time you need to deal with “ilities”, try the following three techniques:

  • start with always/never
  • simple-counter-key
  • ask an extreme question

To illustrate this more concretely, imagine a business representative and a developer discussing system performance:

  • B: “The system needs to be significantly faster” (cross-functional performance requirement)
  • D: “How much faster?” (don’t do this…)
  • B: “As much as possible” (… it will end up nowhere)
  • D: “Pretend it’s magic, and it’s already fast enough. Can you think of something that should never happen? (start with always/never)
  • B: “A customer should never need to wait for a transaction confirmation!”
  • D: “Not even one microsecond?” (ask an extreme question)
  • B: “OK, joker, I see where you’re going with this. A customer should never need to wait for a transaction confirmation more than two seconds”. (simple example)
  • D: “Never? How about a slow wi-fi that we don’t control? Or if the payment provider is taking a while?” (counter examples)
  • B: “OK, not in those cases. But the rest of the time.”
  • D: “How about if we suddenly had 10x higher peak load than our busiest day last year? Should we completely re-engineer the system – this will take months – or do you want it to just work under current peak loads?” (extreme question)
  • B: “We don’t have months – this is for the next two weeks. Current peak loads are fine.” (key example)
  • D: “With current peak load assumptions, we could ensure that 95% transactions get confirmed in less than 2 seconds in two weeks. The payment provider doesn’t guarantee more than 99%, so to exceed that mark we’d probably would have to confirm before actually executing the payment. Not sure if that’s OK. If you want 99.9999% we’d have to give clients cable links to our offices. Where on that line would it be acceptable for you?” (another extreme question)
  • B: “Can we do 97% if I give you a month”? (key example)
  • D: “How about refunds? Same for those?” (counter-example)

Using these clarifying techniques, teams can identify scenarios that could fit into Given/When/Then format nicely for cross-functional requirements. More importantly, the resulting examples can help build shared understanding about cross-functional system aspects. I suggest documenting these examples in a structured format, and making them easily accessible.

One option, good if if you plan to automate tests based on these examples, is to actually use the Given/When/Then scenario format. In this case, you’ll have to break the rules a bit, and misuse a feature file to describe something that is not a feature.

Feature: Payment perfomance

   Timely showing payment confirmations to clients is critical for 
   building trust and avoiding support problems.

   Scenario outline: Puchase confirmations

     Purchase confirmations are the most critical payment type. We 
     want to confirm payments as quickly as possible, but we must 
     wait for third-party card processing. We also expect that some
     clients will experience longer delays due to network issues outside
     of our control. 

     The performance is expected to degrade if the transaction volume
     exceeds current peak levels (10,000 transactions per hour). 
     (Confirmed by James, 4th August 2020).

     Given the current transaction volume is <volume per hour>
     When a new purchase transaction is executed
     Then it should be confirmed within <period>, <percent> of the time.

     | volume per hour | period | percent |
     |          10,000 |     2s |     97% |
     |          10,000 |     5s |     99% |
     |          10,000 |    10s |   99.9% |
     |          20,000 |     5s |     97% |
     |          20,000 |    10s |     99% |
     |          20,000 |    15s |   99.9% |

   Scenario outline: Refunds
    ...

You can keep these scenarios relatively simple, or grow them to cover various other aspects of the system. If the file grows, use a framing scenario: start simple and then show exceptions and complex cases. Once you have a proposed document, you can review it together with the people responsible for service level objectives and client-side service level agreements, to ensure that they are consistent. By the way, SLA and SLO documents can be a great way to start this discussion.

A structured specification such as the one above lets you automate tests in several ways. One potential option is to run a full performance test on a production-like test environment, simulate loads, and check if the numbers match. Another is to just monitor the actual production system over time, extract statistics, and compare the results every day. If the expected targets are no longer being met, send an automated warning so people can investigate and schedule work in the next iteration. The right automation approach depends on the level of risk you want to cover, and the consequences of bad performance.

Regardless of how you automate these tests, make sure to keep the cross-functional feature files in the same version control system as the remaining specifications, to ensure they are up to date and easy to access. I usually keep them close to the related features. For example, if you have a directory containing all the various payment processing feature files, add the payment performance specification there. Another common option is to create a separate directory for overall system specifications, and then put all the ‘ilities’ specifications there. If you feel very ironic, you can even call the directory ‘non-functional’.

Agree on scenarios even if you don’t intend to automate

If you do not intend to automate tests based on key scenarios, then there’s no need to be very formal and waste time capturing examples in Given/When/Then sentences. However, don’t throw away conversation with automation.

Even if you know that you will never automate a test for some aspect of the system, make sure to discuss it. Capturing representative scenarios, challenging them and documenting the results of the conversation can be incredibly helpful over the long term, as it aligns everyone’s expectations.

As an example, consider usability. Most teams I’ve worked with as a consultant do not have strong alignment on what it means to improve usability. Their designers might make suggestions to change graphics, add user prompts or change workflows, but it’s often difficult to judge if the change actually had the desired effect. For a collaboration tool I work on, before a big redesign, we defined usability through representative scenarios. This was amazingly helpful as we could confirm or discard our hypotheses (and many turned out to be wrong).

The same analysis and clarification techniques you would use on functional scenarios apply to usability as well. Start with always/never to come up with initial scenarios, then challenge them. Ask “How would you know the system isn’t usable enough?” and think of some extreme questions. For our tool, we decided that is just not intuitive enough if a new user, who’s never seen the product before, could not create and share a simple document in 10 minutes. We then defined what is a “simple document”, so it’s measurable in terms of the number of elements.

Next, provide counter-examples to evolve a structure, and extract key examples. We looked at how long should it take for an experienced user to perform similar actions, and how increasing document complexity should affect the outcome. At the end, we created a list of bullet points similar to the one below:

  • new user should be able to create and share a simple document in less than 10 minutes.
  • An experienced user should be able to create and share a simple document in less than 5 minutes.
  • An experienced user should be able to create and share a medium-complexity document in less than 20 minutes.

Although these scenarios do not use the words ‘Given/When/Then’, they follow the same sequence of information. It would be trivially easy to rewrite them into Gherkin, but for us that was unnecessary. Having these bullet points in an easily accessible wiki document is fantastic, as we can always review them when working on stories that might affect key workflows. The scenarios also define what is good enough, so we don’t end up gold-plating parts that are already solid.

When we were redesigning the app, I took this specification on the road. At conferences I met a bunch of people, so I asked a few each time if they’ve seen our product before. If not, I asked them for a bit of help with UX testing, instructing them what to try out, measuring the time and taking notes when they got lost in complexity. I’d usually find a few people who knew the tool, and asked them to do the same. This was a very cheap way to learn, once we knew what to look for.

Ultimately, even a specification about usability could be automated. People might argue that it’s impossible to automate something, but in most cases with software it’s actually just prohibitively expensive, not impossible. One option would be to gather statistics from the production environment, expose only a small percentage of users to upcoming changes and measure the difference between their scenario completion and the remaining control group. Another option would be to engage a crowdsourcing service. It’s not impossible, just very expensive. For most teams, I suspect, the cost of continuous retesting would outweigh the benefits. Usability changes slowly and the risk of a single feature horribly breaking it is very low. For aspects such as that, having a good agreement on the specification with occasional manual inspection is normally enough. I would not create a very detailed specification around such scenarios, unless we want to automate tests based on them.

If it’s difficult to put in words, then show, don’t tell

Because Given/When/Then is text-based, that format is difficult to use for system aspects that are not easy to write down as sentences. Workflows are a good example. Simple workflows can be structured as a sequence of steps, but flows with many paths or complex branching become very difficult to read. Describing layouts, look and feel or animations with words is even more tricky, as the descriptions are usually not precise enough to ensure shared understanding.

In cases such as these, you can (and absolutely should) still clarify intent and build shared understanding through examples. Identify representative scenarios, challenge them and agree on key examples. Instead of focusing on Given/When/Then as an automation language, focus on Given/When/Then as an information pattern. Identify preconditions, actions and post-conditions, then find a good way to show the results instead of telling about them. Use diagrams to show workflows, wireframes to capture layouts, screenshots for look and feel and key frames to describe animations.

Sometimes you won’t even need to document these examples in any formal way. Collaborating on analysis, and using examples for discussion, might be enough to build shared understanding. Sometimes you may want to document the examples in a semi-formal specification, and sometimes you may actually want to go all the way and automate the tests. Remember that nothing in software is impossible to automate, it just may be very expensive. If an aspects of a system is critical for you and there is a high risk of it breaking frequently, the cost might be justified.

To show you what I mean, instead of just telling about it, here’s an example from Video Puppet. This is a product I’m currently working on, which makes video editing as easy as editing text. Many features are difficult to precisely specify with words, because it’s almost all visual. For example, transitioning from one video clip to another has to fade out visuals and fit in audio. The previous sentence captures the intent, but is not nearly enough to describe all the complexity of resampling video formats, aligning sound sample rates, resizing and fitting graphics and gradually mixing in frames. I could keep adding a ton of text to explain each of these aspects, or I could just show it in a simple video. In cases such as these, the old adage about an image and a thousand words is absolutely true.

Although automated tests on videos are still difficult, tests on images are easy. I ended up creating most of the feature tests by converting the resulting videos into images, which are easy to check against baseline versions. Each expected result image shows the key frames of a video, laid out into a storyboard, and the audio waveforms of both stereo channels below the storyboard. Here’s the expected result for the feature that adds background music to the whole video, automatically fading it in during the first scene and fading it out during the last scene. This example starts with three clips, so it can show how the background music effect builds up during the first one and closes down during the last one.

Any significant changes to video resampling, audio volume or sound and picture synchronisation will be reflected on this image. An automated test can compare the image reflecting the previous state with the current one, and easily show if the feature works or not, much better than hundreds of text-based specifications.

Don’t be shy to build your own micro-tools for such specific purposes. If Gherkin, the language, is not good enough for something critical to you, use the Given/When/Then approach and figure out how to automate it in a different way. Focus on the purpose of the feature, and show it instead of telling about how you’d test it. For inspiration, check out my talk Painless Visual Testing from Agile Tour Vienna in 2017 where I show how we automated look & feel and layout tests in the Given/When/Then style, using a custom-built tool.

Stay up to date with all the tips and tricks and follow SpecFlow on Twitter or LinkedIn.

PS: … and don’t forget to share the challenge with your friends and team members by clicking on one of the social icons below 👇