Testing is Not One Size Fits All

Coming up with a strategy specific to your software's needs

May 14, 2022

Software testing is very valuable when done right, but can be a huge time sink when done wrong. The main error that causes this issue is to approach testing as being “one size fits all”. You might have some specific experience of testing for a past project, and apply it again to a subsequent project, without thinking about whether it makes sense.

Under this falls any general policy (for example, “100% code coverage unit testing” or “test-driven development”) that doesn’t consider the specific type of test or project involved. And even within a project, some parts might require more or less thorough testing than others. A better approach is to match each type of software with the appropriate type and quantity of tests. In this article I’ll go over the three common types of tests, and how they can be applied to three common types of software.

Types of Tests

There are several common types of tests. These lie somewhat on a spectrum, although there is some overlap.

Unit Tests

Unit tests are the most common type of test, and some people will assume that’s what you mean if you just use the word “test” unqualified. Similarly, some people will say that unit tests should be the bulk of your tests, although I don’t think that is universally true either.

The first key feature of a unit test is that, well, it tests a single unit. This could be a class, module, or function. It should be all code related to one single purpose. Of course this is a bit subjective, but in most languages there is some concept of class or module that provides a clear definition of unit.

The second key feature is that it’s completely isolated. The code under test shouldn’t have any external dependencies also being tested. This often requires the use of mocks (a full discussion is beyond the scope of this article). There can be some disagreement as to what dependencies are ok to leave unmocked. Since the purpose of mocking is to make your test more robust, reproducible, and fast, I would say it’s fine to leave things unmocked that don’t significantly affect any of those attributes. For example, you don’t need to mock strlen(), since there’s no real chance it would break. But you would want to mock random(), since it will make your test nondeterministic.

A common misuse of unit tests is having them test code with too many dependencies. Because this requires mocking out so many things, in practice this means you are actually testing the mock behavior more than any of your own code. Some might argue that ideally your code is structured with few dependencies and looser coupling, but in practice that may not be easy to achieve (or you may be dealing with a legacy system).

Good uses of unit tests are things like standalone modules, especially if they don’t have side effects. For example if you have a billing system, you might have a module that computes the proper amount to charge based on some stateless input. This would be a perfect piece to unit test, and would allow you to ensure you handle all the possible types of billing with any dependencies on external systems. Conversely if your code is not broken up in that way, for instance if your billing system reads from an external db, then writes output back to the db, then a unit test is not a great fit. You would need to refactor the code, or use a different type of test instead.

Integration Tests

Moving along the spectrum of testing we come to integration tests. These are a bit more nebulous than the other two categories since they sit in between them, such that there can be disagreement which category any given test falls into. Integration tests will include some amount of dependencies that they actually execute (they won’t mock out everything) but they will still remove some dependencies to make them faster and more robust.

To make the distinction less fuzzy, I like to consider that in integration tests we should mock out external dependencies, but execute internal dependencies. Here internal vs. external refers to the system the code is running on.

For example, if you have a To Do app and are testing code to create a new To Do item, you would want to mock out the external database (for example, you could use an in memory database, or a local db like SQLite). But you would want to keep your internal frameworks as-is, for example if you have some kind of ORM or other data access library, you want to execute it. Or if you have some validation logic, e.g. making sure the To Do doesn’t contain invalid data, you would want to execute that as well.

Integration tests are arguably more useful and should be more prevalent than unit tests for most applications. This is because having too many mocks can cause tests to miss true bugs, for instance those that arise because of two modules having different expectations. And mocking out external dependencies leaves them fast and robust enough to run frequently. However this may depend exactly on the type of project; I explore this more in the Types of Software section below.

End-to-End Tests

At the other end of the test spectrum are end-to-end tests. These are essentially the opposite of unit tests, they should test an entire system, well, “end to end”. This means that there is no mocking, all dependencies are fully executed. If there are external systems the test will actually communicate with them. Of course, you would want to set up some kind of test instance for any persistent services, since presumably you don’t want your test writing to your production database. But an e2e test will actually talk to the database server, just as your real application does. For something like a UI, an e2e test will actually render the UI in a browser or run the application, and then interact with it, simulating what a user would actually do.

As a result of this, e2e tests are much slower and prone to sporadic failures than other types of tests. Because they actually must communicate with external services, they can fail simply because of transient issues such as timeouts, network issues, etc. that don’t reflect any errors in your code. Additionally they have the big overhead of all this communication. Therefore you typically have many fewer e2e tests than other tests (if any), and so they can’t be as thorough as the other types of tests.

Yet e2e tests serve a very critical role and should be considered for the most important parts of your application. Because they act to mimic real production use as closely as possible, they can detect certain issues that no other test can. For example, perhaps your database schema becomes misconfigured and doesn’t handle some edge cases or you have a UI that renders a button where it’s no longer visible by the user. These types of issues are difficult to catch without e2e tests.

Types of Software

An aspect of testing that is often overlooked is how the type of software you are working on affects the types of tests you should write. In this section I’ll explore some common types of software and which types of tests (and how many) are appropriate for them.

Shared Library

A shared library is some shared piece of code that isn’t executed on its own, but is used by many other pieces of software. It might be open source, or it might be internal to your company. For example, React and boost are two popular open source libraries.

Shared libraries should primarily focus on unit testing, and should have a lot of it. This is because failures in shared libraries can have an oversized impact since there is so much downstream software depending on them. And they don’t have very many dependencies, since they are typically just implementing a very specific type of work, thus it’s easy to mock them all out. To ensure very thorough coverage, it would not be strange to have much more test code than implementation code, potentially even several orders of magnitude more.

Integration tests generally aren’t as common for shared libraries, but they may be used if there are some more complex interactions. For example, React would want to have thorough testing of the whole render pass, whereas boost is more of a collection of separate utilities and has less need for that. End-to-end tests don’t really make sense for libraries.

Backend

Backend services are a very mixed set of applications that perform all sorts of different tasks, generally those that are asynchronous, stateful or data/computation-heavy. Although not often thought of as belonging to this category, I think an API or Web backend has similar testing needs, so you can think of them here as well. With so much heterogeneity the best way to test services will depend on exactly what they are for and how they are structured, but in general I believe integration tests are the most important.

Unit tests can also be helpful, especially for isolated parts (such as helper code or specific business logic). However, often focusing too much on unit testing for services can fail to catch issues resulting in interactions between modules. Each nontrivial API call should ideally be tested with some key inputs (including edge/failure cases), and integration tests are a perfect fit for this type of testing.

There is also opportunity for end-to-end testing of services, for example running against a real deployment. These tests can be flaky, but can also catch things like performance issues or tricky race conditions that are often impossible to detect in normal test environments. Fuzzing can also be a good approach to help detect issues with handling edge cases.

Frontend

Frontend software is any kind of application that users directly interact with. Nowadays this is mainly websites or mobile apps, although there are still some others that exist. In any case, the testing needs of frontend software are similar across different surfaces.

This is where end-to-end testing truly shines. Because it’s very hard to actually mock out a human, it is valuable to simulate actual human interaction in tests. If the user can’t click on a button to finish your payment flow, that might spell very bad news for your app’s revenue. As a result, it’s critical to use e2e tests for all the key flows in your app. This typically includes: sign up, login, key interactions (writing a post, adding a todo item, searching for products and adding them to a shopping cart), and any payment flows. Other secondary flows may also be tested, but it’s less critical and will depend on how much time you have to spend.

Other types of tests are valuable as well, but this depends on the specific makeup of the frontend codebase. For example, sometimes there is a lot of business logic, in which case unit tests or integration tests are key to ensure it is working correctly. Other apps have all of this done server-side, and the frontend is a “dumb” wrapper, in which case these tests are not very valuable. It’s always important to consider what types of bugs the code will actually potentially catch.

Sometimes when rapidly iterating developers will avoid writing tests for frontend code because they can be very sensitive to any changes in UI, layout, etc. I think this is reasonable, something like “test-driven development” is not a good fit when developing software in this way. Still, it’s important to revisit this when actually launching to users/customers. This is because once it’s already being used you are much less likely to make big changes (since it would be very disruptive), and moreover anything breaking is much more of a concern.

Putting It Together

You should now have a good idea of the different pros/cons of the three types of test (unit, integration, and end-to-end), as well as the differing needs of three common categories of software (libraries, backend, frontend). There are other factors that have a big effect on your testing strategy, including:

What stage in the lifecycle your software is. Typically early on testing is less important because you are iterating rapidly, whereas mature software relies on testing more since it’s easier to break and more significant when it does break.
How much reliability matters. Compare working on software for a bank to a helpful, but not required tool. Or the primary use of your product vs. some secondary feature that’s not used very frequently.
How much time and resources you have available. If you are at a startup that could run out of money at any time, focusing on testing might not be the best thing to be doing.
Whether you will be adding significant features or doing refactoring. Either of these justify having better tests, since any large code change will likely break things otherwise. If you are just maintaining the status quo then adding extra tests is less important, since things rarely break without code changes.

Considering all these factors will help you come up with a specific testing strategy for whatever you are working on. This might include:

Which types of tests are you going to focus on, and in what amounts. This article should provide a lot of guidance for this.
How thorough your testing should be. Often this is done with a “coverage” metric, but that’s very poor, since it doesn’t capture anything about test quality (for example, it might be better to test critical code 10x and non-critical code 0x than each of them 1x). I prefer a more qualitative assessment of testing, but this requires some experience to judge.
Whether development is test-driven (write tests first) or not. Some will argue that code and tests should always be written together (in either order), and I generally agree with that for critical code that can’t afford to break, but in practice a lot of code isn’t of that type, so we need to consider tradeoffs.

Whatever strategy you choose, it’s best to get every engineer on the team on board with it. That way you can make sure you are writing the right quantity and types of tests to ensure proper success of your project without creating an excessive amount of extra work.

Catching the Biggest Fish

Discussion about this post