Talking better product launch and allocation decisions with Ferrero USA
The global confectioner mitigates waste, improves service levels and controls costs by connecting digital supply chain visibility with POS analytics.
Keep readingWhen you’re in the very early stages of founding a startup, rapid iteration and finding product-market fit should be the first things on your mind. Unless you know exactly what you need to build, you should be writing code fast only to throw it away later. In these early stages, working on test automation doesn’t really fit in.
However, as you develop your product and find customers, you’ll need to build up your testing infrastructure or you’ll be losing out on customers and velocity. You also need to think of the context – you’re going to be doing it all the while you’re ramping up your engineering team too! The biggest bang for the buck is to be found with end-to-end testing, but it’s also one of the hardest types of testing to do right.
This article is a journey of how we went from manual, mostly exploratory end-to-end testing to fully automated, stable, per-change testing at Alloy.ai.
Stability and time are the hardest constraints on end-to-end testing, in that order.
End-to-end testing of any system, whether a mobile application, embedded device, or a SaaS web platform, is unstable by nature because you’re not testing a closed system. Network connections are inherently unreliable, sensors in mobile devices don’t always work, are subject to interference, and simulated actions perform at a different speed than those of a real end user. These open-system challenges persist for any realistic end-to-end testing environment.
On top of this, your software will have heisenbugs – those rare, unlikely-to-occur bugs, which don’t seem to ever happen when you’re running the same test case manually. However, when you’re running end-to-end tests at scale, these bugs will happen so frequently that it will cause your engineers to distrust the entire end-to-end testing infrastructure.
Time is the other key constraint – end-to-end testing is slow for a number of reasons. Your platform will be slow – certainly much slower than a mocked-out unit test. The network will be slow. The hardware you might be using could be slow or broken. Due to the stability issues, you might need to introduce a lot of waiting time to your testing code, causing further slowdowns with tests.
All of this can make you feel like end-to-end testing is brittle and not worth it. However, it is also the only way for you to test the actual end-user facing behavior of your platform. If you don’t do it, you’re going to be looking at high bug counts, reduced engineering velocity and your customers being upset.
At Alloy, iterating to excellence is one of our core values. What that means is getting to value, fast. Even if the first version is not perfect, it’s good enough if it produces real value. In that context, we set out to solve the end-to-end testing problem by first breaking the problem apart into several sub-problems:
Below, we walk through how we got the tests running stable. The CI pipeline and the staging environment will be discussed in a separate post!
“You aren’t gonna need it” (yet) is one of the extreme programming practices and startup culture adages that everyone has heard of. There is little point to building out elaborate end-to-end testing infrastructure unless you know exactly what you’re building.
We started building out manual end-to-end tests cases early on at Alloy. We started by writing about 40 or so test cases down, and we executed the tests manually. The engineer in all of us might say that this is a terrible idea – we must automate this extra work away! But there are several good reasons why it’s a good thing to stick to manual tests at first:
Because you can’t build out end-to-end tests without the last part, you might as well start with building out manual tests and defining a process where all engineers execute the tests periodically. This can be counterintuitive – you’re a software company after all and so this task should be automated – but it’s the first step to iterating to excellence. You’ll get there!
Here’s how we define our manual tests and the process around them:
We define tests using Google Sheets. There are manual testing tools like TestLink, TestRail, XRay and so on and might be worth considering. The need for more elaborate tooling might present itself later, but it’s another thing you can scope out from an initial implementation.
What you shouldn’t scope out is writing clear preconditions, steps and expectations for your tests. When you’re writing a test case, don’t initially skip on any steps that the user would need to do – write every click and typing effort as its own line item. This helps achieve both clarity as well as a clear understanding of how easy it is for the end user to accomplish an action.
Below is an example of a simple test, written in Google Sheets.
Each test has a name, priority, expected outcome and preconditions at a minimum. The steps need to be clear, as you’ll want the team to be able to participate in the testing effort, and later on also distribute the implementation of tests.
As you’re defining manual test cases and their steps, you’ll find yourself writing a lot of repeating user actions. It’s tempting to simplify complex user actions into one-liner items, but keep in mind that the user is likely to be performing the actions over and over again, too. If you find yourself covering the same, long set of steps to perform a necessary action, there is probably room to improve the product itself! At Alloy, we’ve been able to identify areas of improvement by identifying these hidden complexities in user flows as we’ve defined test cases.
The above screenshot doesn’t show it, but another thing we tracked early on was an additional column called Automation notes. While it might not make sense to build end-to-end testing infrastructure for a small company from the outset, it does makes sense to start thinking about how you’d do it early on, and identifying potentially problematic areas that you’ll want to address as you move into automation.
After we had mapped out the manual tests we wanted to regularly execute – about a hundred or so – we started building out test automation with Cypress. Cypress is a fantastic tool for end-to-end testing. It records videos of the test runs out of the box, and its test runner is Google Chrome – which pretty much all engineers are familiar with. It also records the actions that it took on the DOM on video, effectively giving you clear, visually debuggable records of failed test runs. It’s a great tool out of the box.
We use Selenium elsewhere at Alloy.ai for process automation, which is a very much battle-tested browser automation tool. We considered that as an option to limit the number of technologies in the stack but went with Cypress because of the above benefits. When I first started to use Cypress and realized that rendering test results to video also works flawlessly in CI with hardly any configuration, it was love at first sight!
Our Cypress tests are full end-to-end tests – we mock nothing out. We have a staging environment where the tests are run, hosting its own replicas of the databases and other services that our platform needs. The tests, then, needed to be written in a way that resemble human-readable test steps as much as possible.
Here’s an actual example of what we came up with:
it('can set two inclusive filters by pasting comma-separated text', () => { Cypress.App.createDashboardFromJson('e2e-chart-and-flat-table.json'); Cypress.App.topBar.filterEditor.addInclusiveFilterByPasting('State', 'AK,CA,ID'); Cypress.App.wait.untilReady(); Cypress.App.topBar.getFilterSubtitle('State').should('contain.text', 'AK, CA, ID'); });
The above code snippet highlights two of the key additions we developed on top of Cypress:
More on these two below.
Often, you see test code that looks like this:
cy.get('.dropdown-content .option').click(); cy.get('.sidebar .container .selection span').should('contain.text', 'Ciao');
This test raises some questions. What does it do? What is the action that the user is performing? There are some hints here, but it’s certainly not obvious. The user is clicking on some element called “option” inside what seems to be a dropdown, and after clicking, a span in a container element should contain text “Ciao”. What these elements actually represent, we do not know. Now, imagine an actual test case in a real app. The names are going to be confusing, there are going to be dozens of lines of CSS selectors, and you’ll end up having to run the test to see what all of its parts do.
When you’re reading through a test, it’s most often because it failed. That means you have a problem that you need to solve, and understanding what the test means should be the least of your worries. A good test case should be self-documenting and easy to understand.
Reusability of testing code is important, too. Let’s say you have a top bar in your app that allows the user to navigate between experiences – you’d want to reuse the code that lets users perform those actions.
Finally, some actions are simple to the user, yet complex underneath. Consider a search bar – it might need to populate its content suggestions from the backend as soon as the user clicks on it. Behind this click, there is an asynchronous loading action involved, and you’d almost certainly want to encapsulate that kind of complex testing code behind a single function.
For the above reasons, something like this makes a lot of sense:
interface TopBar { SearchBar getSearchBar(); } interface SearchBar { void type(text: string); void selectFirstAutocompletionSuggestion(); void selectAutocompletionSuggestionByIndex(index: number); }
This type of approach allows you to encapsulate the loading actions behind the implementations of type and/or selectAutocompletionSuggestionByIndex
. When you’re implementing the test, the test might now look like this:
Cypress.App.TopBar.getSearchBar().selectFirstAutocompletionSuggestion(); Cypress.App.SideBar.getSelectedDashboard().should('contain.text', 'Ciao');
This tells the reader clearly what is being clicked on, and what the expected outcome should be!
We needed a custom solution for waiting because our software loads a lot of data on the fly, and we couldn’t guarantee when exactly these asynchronous actions would be complete. After we had finished writing the first iteration of Cypress tests, they mostly worked great, but every now and then an oddly timed asynchronous request would be delayed, causing issues.
We found that waiting for HTTP events is not entirely stable in Cypress. We used cy.route().as()
and cy.wait()
both, but found these interfaces to be flaky – the waits didn’t always happen, causing test execution to fail as things moved past the waits before the app was in the right state. We tried various workarounds for this but didn’t find one that actually made things stable. To add to this problem, cy.route()
is not conceptually right when writing end-to-end tests, either. What you want is a wait condition based on the UI, just like the real user would do. And these UI-based waits are rock solid in Cypress, from our experience.
All of Alloy’s web app’s HTTP requests go through a middleware layer. Every single fetch request, be it GET, POST or DELETE, goes through a common code path in our codebase. This has allowed us to centralize e.g. authentication and error handling in the past. So all we did was to add a simple counter, activeRequestCount, which increments when a new request is started, and is decremented again when a request completes. We store this count in Redux, which is where our data is stored, and a UI component, ActiveRequestIndicator
, renders a visual representation of this count – a progress bar on active requests, if you will.
ActiveRequestIndicator
has one more trick up its sleeve – it debounces the changes to the active request count when decrementing. That is, when the active request count reaches zero, the visual indicator is only removed a second or so later. This gives enough time for any asynchronous actions to complete that may occur after the data fetching is done.
Finally, Cypress tests can then wait for active requests to finish, in addition to any other waits you might have in place. In our Cypress framework code, we express it like so:
export function waitUntilReady() { cy.get('.Loader:visible', {timeout: 60000}).should('not.exist'); cy.get('.active-request-indicator-item:visible', {timeout: 60000}).should('not.exist'), }
And that’s it! Individual tests use just this one waitUntilReady function in Alloy’s Cypress tests, and so the test authors don’t need to think about which background requests will fire. All the test author needs to think about is when to wait, but not for what. And most important of all, this approach is stable.
By the time we got to writing tests, we already had them defined – we’d done all the work to outline the test cases, prioritizing them, and sorting out the ones that are hard to automated. So writing tests became a matter of converting the test cases into Cypress ones.
We annotated which tests are automated and which weren’t, so as the test creation effort was ongoing, we could maintain a single source of truth for all test cases.
Some of the manual test cases turned out to be not worth automating due to too large upfront effort. For example, Alloy’s chart and map widgets can’t be easily tested automatically as you’d have to rely on pixel-perfect positioning of the cursor and clicking on things for an action to occur. There are way higher bang-for-the buck test cases out there, so we quickly ruled out implementing these tests as part of the first pass. Prioritizing which tests to write can make a huge difference in both stability and implementation time.
For most of the initial tests we built, new pieces for the testing framework were built, too. This made the first tests slower to write, but made all subsequent tests much faster to implement, and more stable as well. For example, when we wrote the first test that allowed the user to edit the dashboard’s filters, we needed to add something like addInclusiveFilterByPasting() as a reusable helper method. Writing these helper methods in a reusable way at first is a bit of work, but it’s necessary.
Aside from code reuse and maintainability, using reusable helper methods to execute user-facing actions also gives you predictable structure and consistency. We all follow existing patterns we see and hear when communicating with other people. This includes writing code, which is a form of communication, too. Creating a solid baseline of what tests should look like was really important for us. It has invisible and powerful effects, like reducing the amount of work you have to put into code reviews later, and improving the stability of the tests over time.
Setting up standards in terms of code style, architecture and patterns is all the more important as you introduce new components and new ways of doing things into your system. Cypress and end-to-end tests was a new addition to Alloy’s codebase, and setting up the test framework in a careful and measured way helped us set ourselves up for success.
While we’re really happy with the end result, a few things did not go as well as we’d have hoped for. The problems we ran into were not as much with technology as they were process-related.
We originally implemented Cypress e2e tests already in summer 2019. We spent amount three engineering weeks on this – not a huge amount of time, but significant at our size. This effort was wasted as we never got to a place where we had the staging CI up and running, and executing these tests. We also did not introduce a manual process to e.g. execute the tests periodically on developers’ machines. We could have gotten value out of the tests a year earlier, but failed to do this from a project management perspective. Simply put, we started too early.
The good part is that most of the key technology also discussed here was part of this initial change, which greatly reduced the hurdle of getting the tests working again – there wasn’t a lot of code rot. But we should’ve carried the work all the way through .
If we could do this again, we should’ve introduced a manual process and migrate step-by-step into getting the tests to work.
Once we got the tests running mostly smoothly, we turned them on in the CI. This worked well for the most part, but there were a few remaining test instabilities that were not fully addressed. During one week, this caused many of the PR verification builds to fail, causing hours of wait times to get a green build result for new changes. We addressed this by introducing a rule that all e2e test failures are automatically critical bugs if they block the PR verification builds, and pushing a patch to disable the test while the critical bug is addressed. This worked really well and helped us address the remaining issues.
If we could do this again, we should have introduced the above rules upfront, at the same time when turning on the end-to-end tests on a per-PR basis.
Here’s a summary of our lessons learned. If you’re thinking of setting up e2e tests for your organization from scratch, this can be useful:
In a technology sense,
The global confectioner mitigates waste, improves service levels and controls costs by connecting digital supply chain visibility with POS analytics.
Keep readingHow to take an iterative approach to digital supply chain transformation with real-time alerts that motivate teams to collaborate on issue resolution
Keep readingUnderstand how gaps between systems, teams and processes are keeping you constantly firefighting and hurting your supply chain resilience
Keep reading