Function tests in Bash and the errexit mode

#!/bin/sh vs #!/bin/bash, POSIX mode and the inconsistencies of errexit

Dávid Csákvári

5 mins

Leaking Bash options can cause serious headaches when unit testing shell scripts because sourcing a test file might completely change the behavior of the subsequent commands used in the test. A leaking set -e might be especially troublesome as it can cause the test case to terminate early. However, there's one more problem with it: in some cases set -e does not work with subshells.

Unit testing Bash scripts

Leaking Bash options and modularization challenges

How it affects testing

In unit testing it's common to invoke a function in a subshell to capture its output:

source myscript.sh   # 'import' function definitions
result=$(myfunction) # execute function
...                  # verify the output

In fact most Bash testing frameworks --- such as Bats or bash_unit --- recommend this practice.

Due to inconsistencies related to the -e option, it might be possible that when a function is executed like this, it will not take -e into account, not even when set -e is defined in the sourced script.

Example

Here is an example where this problem surfaces.

#!/bin/bash

set -e

function getResponseCode() {
  local url=$1
  local result
  result=$(curl --fail -o /dev/null -s -w "%{http_code}\n" "${url}")
  echo $result
}

if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
  getResponseCode "$url"
fi

The script queries a URL to print the response's HTTP result code. It is designed to terminate silently in case of errors, such as HTTP 404.

# A valid url returning HTTP 200
» ./query.sh "https://www.google.com"
200
» echo $? # check exit code
0         # 0 -> OK

# An URL that redirects to https://www.google.com
» ./query.sh "https://google.com"
301
» echo $? # check exit code
0         # 0 -> OK

# URL returning HTTP 404 not found
» ./query.sh "https://www.google.com/invalid-url-returning-404"
» echo $? # check exit code
22        # non-zero -> terminated with error

This behavior is achieved with two things:

At the beginning of the script set -e enables Bash's errexit option. It causes the early termination of the script when a command returns with a non-zero exit code.
Curl uses the --fail flag, which makes curl to exit with code 22 when it encounters typical 4XX and 5XX HTTP responses.

So far so good, works as expected. Let's write some function-level tests for getResponseCode.

Unit tests on the loose

Let's jump directly into the negative test case where things get interesting. As a start, I wrote the following using bash_unit:

test_getResponseCode_should_be_silent_on_404() {
  source query.sh
  result=$(getResponseCode "https://www.google.com/invalid-url-returning-404")
  assert_equals "" "${result}"
}

I expected this test case to fail because of sourcing query.sh the test case inherits the script's errexit option and I knew that getResponseCode will exit with 22. However, it failed for a different reason:

Running test_getResponseCode_should_be_silent_on_404... FAILURE
 expected [] but was [404]

😲 😱 😳

For some reason, in this test scenario getResponseCode behaved differently than when it was called during normal execution.

At first sight I suspected that maybe the test framework does something that negates the effects of errexit, but the problem persisted even when I removed it from the equation:

#!/bin/bash

source query.sh
result=$(getResponseCode "https://www.google.com/invalid-url-returning-404")
echo "${result}" # still printing 404 instead of failing early

The cause

Being puzzled, I've posted this problem to r/bash where experienced veterans revealed the truth.

This strange behavior is a quirk of Bash's errexit option: subshells do not inherit the -e option unless Bash is running in POSIX mode.

Subshells spawned to execute command substitutions inherit the value of the -e option from the parent shell. When not in posix mode, bash clears the -e option in such subshells.

Examining the Bash POSIX Mode manual revealed that POSIX mode is not the default unless the script is invoked as sh.

So what's going on here?

My original test case was executed via bash_unit. Here's how the source code of the test framework starts:

#!/usr/bin/env bash
...

When I replaced bash_unit with a plain Bash solution to load and exercise the function. If you scroll up to the end of the last section, you can see the full source code, but the important detail is in the first line:

#!/bin/bash
...

None of these are invoked as sh... so can it be that POSIX mode was not enabled in these cases?

I've replaced the first line of the plain Bash solution to invoke sh rather than bash...

#!/bin/sh
...

... and the snippet started to behave as I expected.

side by side comparision showing the difference between #!/bin/bash and #!/bin/sh

Such small thing changes how Bash works big time. If you are interested in all the 59 ways how POSIX mode (or the lack of it) might change your script's behavior, just check the manual.

🙊 🙈 🙉

POSIX mode can be enabled explicitly with set -o posix. Alternatively, it's possible to make subshells inherit -e without using POSIX mode with shopt -s inherit_errexit, but that's only available in newer Bash versions.

Summary

First when I bumped into the Use Bash Strict Mode article I started using -e, as it seemed to bring the capabilities of Bash closer to a real programming language. This might be true for shorter scripts, but for anything just a little bit more complex its downsides start to appear.