How to Bash and jq: generate statistics for a REST API
Walkthrough for writing easy-to-read and performant scripts
Bash scripts are infamous for their maintainability, but with pipes and small functions, the result
can be terse, readable code. And it's not only code quality: writing concurrent applications is a
hard task, but it's easy to achieve in Bash by relying on this mechanism. jq
— a tool to query
and manipulate JSON — fits nicely into this world.
In this tutorial, we'll create a script step by step to analyze content available on a REST endpoint.
The Goal
For the sake of this post, the backend application is a dummy blog engine running on localhost. You can find it on GitHub if you'd follow along. It provides the following API endpoints:
/posts?pageNumber=<num>
: List metadata for all posts in a paginated list: ID, title and tags./posts/<post_id>
: Get all details regarding a post, including its full text content.
The goal is to calculate the percentage of Bash-related content compared to all posts in the system, based on character count. A post can be considered Bash-related if it has the "Bash" tag.
0. Getting started: a sneak peek at the endpoints
As a first step query the posts available in the blog. To get the data simply curl
the endpoint:
» curl "http://localhost:8080/posts?pageNumber=0"
{"count":10,"currentPageNumber":0,"lastPageNumber":1,"posts":[{"id":1,"title":"The best Javascript frameworks in 2020","status":"draft","tags":["Javascript"]},{"id":2,"title":"Vim in 5 minutes","status":"published","tags":["Bash","Vim"]}, ... ]}
The response contains an unformatted JSON output; everything is squeezed into a single line so it's a bit hard to read.
To get a better view of the data, let's pipe the output of curl
into jq
to get some
nice formatting:
» curl "http://localhost:8080/posts?pageNumber=0" | jq
{
"count": 10,
"currentPageNumber": 0,
"lastPageNumber": 1,
"posts": [
{
"id": 1,
"title": "The best Javascript frameworks in 2020",
"status": "draft",
"tags": [
"Javascript"
]
},
{
"id": 2,
"title": "Vim in 5 minutes",
"status": "published",
"tags": [
"Bash",
"Vim"
]
},
...
]
}
Because the API is paginated, the posts are wrapped in a container object.
Let's take another look, this time on the details of a single post:
» curl "http://localhost:8080/posts/1" | jq
{
"id": 1,
"title" :"The best Javascript frameworks in 2020",
"content": "Lorem ipsum dolor sit amet, consectetur adipiscing elit...",
"tags": ["Javascript"]
}
Since this is just a dummy blog engine, all posts have generated Lorem Ipsum
as their content. The single interesting stuff in this second response is the content
field. Let's
use jq
to extract it:
» curl "http://localhost:8080/posts/1" | jq '.content'
"Lorem ipsum dolor sit amet, consectetur adipiscing elit..."
The argument passed to jq
specifies a transformation, returning only the value found in the content
field.
curl
and jq
does a fairly good job by default, but there are a few flags that might come in handy
when they are used in a script.
Most flags can be set using a short (e.g. -x
) and a long-form (e.g. --some-opt
). For better
readability use the long versions in scripts.
Make curl
calls more robust
By default, curl
shows a progress bar for longer operations on its standard error output. Because
of this, our script might also display it whenever curl
is invoked under the hood. To make the output
a bit cleaner, use the --silent
(or -s
) flag to enable silent mode. Note: the progress bar does
not affect programs downstream of the pipeline, as it's presented in the standard error, not the standard
output.
Using --silent
has the undesired consequence of silencing error messages as well. To reenable them,
use --show-error
(or -S
).
To make curl
handle the Location
header and the 3XX
response codes, add the --location
(or -L
) flag.
In case of an HTTP error code, curl
will just output the error document returned by the server.
These error documents either have to be filtered down the pipeline, or they might cause problems.
To avoid this, pass the --fail
(or -f
) flag to make curl
silently fail on server errors. Although
curl
will emit an error message on its standard error relating to the HTTP error code, the error message
from the server will be suppressed, which can hinder debugging.
Use compact and raw output for jq
jq
outputs JSONs in a nice, human-readable format. This is not convenient for pipelines,
because the data is passed around line-by-line. In order to squeeze the result of jq
transformation
into a single line, use the --compact-output
(or -c
) flag.
For example, let's try the .posts[]
transformation to decompose the list of posts and print each element
in a single line:
» curl http://localhost:8080/posts | jq --compact-output '.posts[]'
{"id":1,"title":"The best Javascript frameworks in 2020","status":"draft","tags":["Javascript"]}
{"id":2,"title":"Vim in 5 minutes","status":"published","tags":["Bash","Vim"]}
Another useful flag for jq
is the --raw-output
(or -r
). When you query a single String attribute,
by default it will be returned in a quoted form. Use this flag to omit the quotes:
» curl "http://localhost:8080/posts/1" | jq --raw-output '.content'
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla vulputate tortor ut diam rhoncus cursus...
1. Foundations
Before diving into the implementation, let's discuss some general, high-level decisions about the architecture of the script.
Use #!/bin/bash instead of #!/bin/sh
On many systems /bin/sh
is just a symlink to /bin/bash
, but this is not a given.
/bin/sh
represents the POSIX compliant system shell, but under the hood, it can be
something different than Bash.
If you only target Bash, from a testing point of view it's much easier to explicitly
rely on it than to support anything the /bin/sh
might refer to:
#/bin/bash
Use Bash Strict Mode
Bash is great because it can be used easily in an interactive terminal session. As such, it processes commands a bit differently than a typical programming environment. For example, when you type a bad command to your shell, you expect it to print the error and continue to accept further commands. This makes Bash an excellent tool for trial and error.
However, the very same ergonomic feature can cause quite some trouble when Bash is used as a programming language. Consider the following script:
#!/bin/bash
echooo hello #o-oh, here's a typo!
echo world
If you know other programming languages you might expect that running it will print nothing but an error. But because Bash is lenient by default it continues to execute commands even after an error occurred:
./hello-wolrd.sh: line 3: echooo: command not found
world
To make life easier, start the script with the following:
set -euo pipefail
It adds some strictness to Bash to make it better suited for programming:
- exit immediately on failures
- exit immediately if an undefined variable is referenced
- make errors visible in pipelines
More information: Use Bash Strict Mode (Unless You Love Debugging).
Rely on Pipelines
Pipelines are one of the essential building blocks of shell scripts, connecting the standard output of the preceding command to the standard input of the following one. (Standard error can also be piped to the next command, but we don't do that in this script.)
first_command | second_command | third_command
Pipelines are beneficial:
- Readability. Combining commands can't be much simpler. No need for temporary variables, just define the transformations which will perform changes on the data flowing through the pipe.
- Concurrency. Each command in the pipeline runs concurrently. When the pipeline handles multiple data, all commands in the pipeline can work concurrently.
The above illustration is from Wikipedia where you can read more information on the Unix Pipeline.
Use local
for function-specific variables
If you can wire together everything with pipes you don't need variables to store intermediate computation results.
However, if you have to define variables annotate them with local
. This ensures that the variable is only
available inside the function and it's not polluting the global scope.
local my_var="Hello"
2. Scrape all data
Now we know how the Blog's REST API looks like and have a basic idea about the structure of the script, let's create a script that collects the data from the server.
#!/bin/bash
# strict mode
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
set -euo pipefail
main()
{
get_posts
}
get_posts() {
local response=$(curl --silent --show-error "http://localhost:8080/posts")
echo $response
}
main
The var=$(...)
form executes the given command and captures its standard output in a variable.
It's preferred to use this form over the legacy backtick notation
as these expressions can be nested.
Note, that Google's Bash style guide
recommends to put the declaration and the assignment of local
variables into different lines because local
does not propagate the exit code of substituted commands. Since I'm not using the exit codes of substituted commands
in this script I've decided to neglect this advice for the sake of readability.
The main
function seems to be superfluous for now, but in the future, it will contain the high-level
logic of the script. Because usually, this is the most important part for the reader, I defined
it as the first function in the script.
Currently, the output of the program only contains the first page of posts only. Let's make it recursive to take care of the pagination:
#!/bin/bash
# strict mode
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
set -euo pipefail
main()
{
get_posts 0
}
get_posts() {
local page_num=$1
local response=$(curl --silent --show-error "http://localhost:8080/posts?pageNumber=$page_num")
echo $response
local last_page_num=$(echo $response | jq --compact-output '.lastPageNumber')
if [ "$page_num" != "$last_page_num" ]; then
local next_page=$(($page_num + 1))
get_posts $next_page
fi
}
main
The addendum checks if the response is the last page, and if not, it calls itself one more time for the next page.
Because there are 12 posts in the blog engine, the output looks like this:
{"count":10,"currentPageNumber":0,"lastPageNumber":1,"posts":[{ ...first 10 posts... }]}
{"count":2,"currentPageNumber":1,"lastPageNumber":1,"posts":[{ ...last 2 posts... }]}
3. Parse API response
We have all the data from the server, but the server returns a wrapper object with an array of posts for each HTTP request.
Also, each post in the response contains the title of the post, which is not needed for our calculation.
With jq
it's easy to reshape the server's response to fit our needs:
main()
{
get_posts 0 | parse_response
}
...
parse_response()
{
jq --compact-output '.posts[] | {id: .id, status: .status, tags: .tags}'
}
The output of the script now only contains the important data, one post in each line:
{"id":1,"status":"draft","tags":["Javascript"]}
{"id":2,"status":"published","tags":["Bash","Vim"]}
{"id":3,"status":"published","tags":["Java"]}
{"id":4,"status":"draft","tags":["Terraform","AWS"]}
{"id":5,"status":"draft","tags":["Terraform"]}
{"id":6,"status":"published","tags":["Git"]}
{"id":7,"status":"published","tags":["Bash"]}
{"id":8,"status":"published","tags":["CORS","Security"]}
{"id":9,"status":"draft","tags":["Java","Security"]}
{"id":10,"status":"published","tags":["ZSH","Bash"]}
{"id":11,"status":"published","tags":["Java"]}
{"id":12,"status":"published","tags":["PlantUML","Javascript","Ruby"]}
4. Exclude draft posts
Amongst the posts, there are some not-yet-published draft articles. Let's filter them out:
main()
{
get_posts 0 | parse_response | only_published
}
...
only_published()
{
jq --compact-output '. | select( .status=="published" )
| del( .status )'
}
The jq
command above chains two operations. The first one suppresses non-published posts while the
latter one removes the now irrelevant status
property. Note, that probably it would be better to keep
this data till the end to ease debugging. I removed the unnecessary properties to make the output
of the intermediate steps free from noise, hopefully making the post easier to follow.
Now the output looks like this:
{"id":2,"tags":["Bash","Vim"]}
{"id":3,"tags":["Java"]}
{"id":6,"tags":["Git"]}
{"id":7,"tags":["Bash"]}
{"id":8,"tags":["CORS","Security"]}
{"id":10,"tags":["ZSH","Bash"]}
{"id":11,"tags":["Java"]}
{"id":12,"tags":["PlantUML","Javascript","Ruby"]}
5. Add content length for each post
For each post, we have to get its full content from the second API endpoint. Until now, we
just piped various commands together, but in this case, we have to execute a series of commands based
on the id
field in each JSON object.
For this reason, I use xargs
, which is designed to build and execute command lines from standard input:
main()
{
get_posts 0 | parse_response | only_published \
| xargs -d'\n' -I '{}' bash -c "add_length '{}' ."
}
- The
-I
specify the placeholder to be replaced by the content coming from the standard input. - To not strip quotes in the arguments, we have to specify the delimiter explicitly (Why does xargs strip quotes from input?)
With this, add_length
will be called with each JSON object as an argument. Let's define it:
add_length()
{
local post=$1
local id=$(echo $post | jq '.id')
local length=$(curl --silent --show-error "http://localhost:8080/posts/$id" \
| jq --raw-output '.content' | wc --chars)
echo $post | jq --compact-output '. + {length: '$length'} | del( .id )'
}
export -f add_length
In order to call a function with xargs
it has to be exported and called via a subshell.
The subshell could be invoked with sh
, but to be consistent with my previous recommendations and to avoid
incompatibilities I explicitly call bash
here.
With this change, the output now contains the length
property for each post.
As an alternative to xargs
, we could have used a simple loop that reads from the standard input.
However, xargs
can spawn multiple processes. Depending on your workload, this might come in handy
to increase performance. In the following example, the number of processes is 4, specified by the -P
flag:
xargs -d'\n' -I '{}' -P 4 bash -c "add_length '{}' ."
In my case, when I've used 4 processes to query the content bodies I've measured roughly 2x speedup in the total execution time.
Now the output contains the length of each post:
{"tags":["Bash","Vim"],"length":494}
{"tags":["Java"],"length":694}
{"tags":["Git"],"length":344}
{"tags":["Bash"],"length":872}
{"tags":["CORS","Security"],"length":425}
{"tags":["ZSH","Bash"],"length":307}
{"tags":["Java"],"length":423}
{"tags":["PlantUML","Javascript","Ruby"],"length":408}
Before going to the next section, let's refactor this to be a bit more readable. The implementation
details of calling xargs
leaked to the main function. So let's extract it to its own separate place:
main()
{
get_posts 0 | parse_response | only_published | add_length
}
...
add_length()
{
xargs -d'\n' -I '{}' bash -c "add_length_ '{}' ."
}
add_length_()
{
local post=$1
local id=$(echo $post | jq '.id')
local length=$(curl --silent --show-error "http://localhost:8080/posts/$id" \
| jq --raw-output '.content' | wc --chars)
echo $post | jq --compact-output '. + {length: '$length'} | del( .id )'
}
export -f add_length_
I've renamed the original function to add_length_
, and wrapped the xargs
related code
in a separate function. Now the main function is short and concise again.
6. Aggregate the character counts
Now we have all the important data: the tags and the length of each post. By aggregating this data, we can calculate the total length of all the bash and non-bash related posts.
Although jq
's group_by function
can group elements based on any arbitrary expression, I'll introduce another transformation before
the grouping to precalculate the condition. This step replaces the tags
array with a boolean that
indicates if the post has the "Bash" tag.
main()
{
get_posts 0 | parse_response | only_published | add_length | add_isBash
}
...
add_isBash()
{
jq --compact-output '. + {isBash: (. | any(.tags[] ; contains("Bash")))}
| del( .tags )'
}
With this, the output is as follows:
{"length":584,"isBash":true}
{"length":785,"isBash":false}
{"length":432,"isBash":false}
{"length":962,"isBash":true}
{"length":529,"isBash":false}
{"length":393,"isBash":true}
{"length":504,"isBash":false}
{"length":523,"isBash":false}
The last transformation is to actually perform the aggregation based on the isBash
property.
So far all jq
operations were executed in streaming mode: they were processing JSON objects
immediately as they were available from the previous step or from the REST endpoint.
To sum the lengths of all posts jq
has to be instructed to wait for all the data. This can be
done with the --slurp
(or -s
) flag.
main()
{
get_posts 0 | parse_response | only_published | add_length \
| add_isBash | aggregate
}
...
aggregate()
{
jq --slurp --compact-output 'group_by(.["isBash"])[]
| map(.length) as $carry | .[0]
| . + {lengthTotal: $carry | add}
| del(.length)' \
| jq --slurp --compact-output '.'
}
The transformation above performs the aggregation, producing two objects: a character sum for each Bash-related post, and a character sum for the rest of the content. For more information on how this function works see this related Stack Overflow thread.
Because it emits two separate objects, I've piped the result into jq
again to slurp
everything to produce a single JSON:
[{"isBash":false,"lengthTotal":2773},{"isBash":true,"lengthTotal":1939}]
7. Compile report
Now that we have all the data in place, it's easy to calculate the answer to the original question using
bc
, the command line calculator.
main()
{
get_posts 0 | parse_response | only_published | add_length \
| add_isBash | aggregate | report
}
...
report()
{
while read object
do
local non_bash_length=$(echo $object | jq -c '.[] | select( .isBash==false ) | .lengthTotal')
local bash_length=$(echo $object | jq -c '.[] | select( .isBash==true ) | .lengthTotal')
local total=$(echo "$non_bash_length + $bash_length" | bc)
local ratio=$(echo "scale=4;($bash_length / $total) * 100" | bc)
echo "Ratio: $ratio %"
done < "${1:-/dev/stdin}"
}
I could have used xargs
again to capture standard input, but to demonstrate the alternative
I've used the while loop. It's a bit simpler, but lacks parallelism. The function captures the single
line we pass to its standard input and uses jq
to parse it as a JSON and extract data from it.
(Note: I've used the -c
flag for jq
instead of --compact-output
to make the code snippet a bit more compact.)
Finally, it invokes bc
to calculate the result:
Ratio: 41.1500 %
Bonus: show interactive output
Although the script works nicely, it's totally silent while it's working. If the script would require a longer time to finish, the users would not get any feedback during the calculation.
Fork the pipeline with tee
to display each post as it's being processed:
main()
{
get_posts 0 | parse_response | only_published | add_length | tee /dev/tty \
| add_isBash | aggregate | report
}
Summary, considerations
By relying on the pipeline concept, Bash scripts can be efficient and easy to understand.
Adding jq
to the mix brings powerful JSON manipulation techniques to the table.
However, these scripts have their limitations. Although Bash is available in most environments, the tools used from the scripts might differ. These dependencies might make it hard to produce cross-platform code.
Bash is most efficient for integrating tools, not for implementing large programs and algorithms. If the script starts to get too big, consider using a different language, for example Python. In any case, consider using a linter, such as shellcheck, and if you are new to Bash, make sure to check out a good Cheat Sheet.