Interpretable Counting for Visual Question Answering

11 min read
Interpretable Counting for Visual Question Answering

Learning to answer open-ended questions about images, a task known as visual question answering (VQA), has received much attention over the last several years. VQA has been put forth as a benchmark for complete scene understanding and flexible reasoning, two fundamental goals of AI. But, apart from accurate answers, what would these milestones look like in a successful VQA model? When it comes to the visual component, one line of thinking points to scene-graphs. In a scene-graph, an image is organized into the underlying objects, their attributes, and the relationships between object pairs. This kind of organization is very easy to understand because each element of the scene (‘person’, ‘dog’, ‘next to’, etc.) gets its own representation. Capturing the image this way could provide a fairly ideal scaffold for extracting information relevant to answering a given question.

Thinking about VQA this way brings another consideration into focus: interpretability. If the representation of the image is easy to interperet, can we use that property to make an AI that's easier to understand? A central idea here is that interpretability requires a reasoning process that is grounded in discrete visual representations and that preserves that discrete organization when forming an answer. Here, we zoom in on this notion of discrete representation and explore a subtask within VQA that has been met with particular difficulty: counting. Counting questions offer an ideal entry point for exploring networks built with the goal of interpretability in mind, such as the one we introduce here, which we call Interpretable Reinforcement Learning Counter (IRLC).

Our model learns to count by enumerating relevant objects in the scene, formulated as the process of sequentially deciding which object to add to the count. We model how the choice to count one object affects the way other objects are used in subsequent steps and train the network using reinforcement learning (RL). For all our experiments, we represent the image as the set of object proposals generated from a pre-trained vision module. Figure 1 shows an example of IRLC in action. The inputs are the question and the object proposals from the vision module. IRLC uses the question to give each object an initial score, then cycles through picking the highest-scoring object and updating the scores based on the selection. When none of the remaining objects have a high enough score, the process terminates, giving a count equal to the number of selected objects.

Figure 1: IRLC in action. Each object proposal is given an original score (shown in red), the transparency of each box is a function of the score: darker red = larger score. Each step, the object with the maximum score is selected, and we draw the chosen objects in blue. Then, the remaining object scores are updated based on the selected object (notice how the unchosen, red boxes get weaker).

After training on counting questions, IRLC outperforms the other baselines we experimented with, including a model adapted from the current state of the art for VQA. In addition, IRLC points to the exact object proposals on which each count is based. In other words, it returns visually-grounded counts. Compared to the attentional focuses produced by a more conventional model, we argue that the visually-grounded counts of IRLC provide a simpler and more intuitive lens for interpreting model behavior.

Models and approach

Figure 2: The general architecture of the‌‌counting networks we experiment with.

Figure 2 shows the general pipeline we experiment with. There are 3 basic stages. In the first stage the image is processed (blue-shaded area), returning a set of object proposals and the estimated relationships between objects. Each object comes with a bounding box, telling us where in the image it is, and a vector encoding, which captures the features of the object. Anderson et al. (2017) trained an object detector as part of their work which led to state of the art in VQA; the detection outputs of their model are publicly available, so we use them here. In the second stage (green-shaded area), we use an LSTM RNN to get a vector encoding of the question. We estimate the relevance of each object proposal to the question by comparing the question and object embeddings using a scoring function, which is set up as a single feedforward layer in the network. In the third stage (red-shaded area), we use the outputs of the previous 2 stages to generate a count. We experiment with three strategies, each of which is defined by a distinct counting module.

We set up our work this way for a few reasons. In particular, previous studies don’t train on or report performance on counting specifically, so we need to do those experiments ourselves. In addition, setting the experiments up this way lets us control for many of the little decisions that likely affect absolute performance. For example, by using object proposals from the pre-trained model to serve as the vision module, we ensure that all our counting models use the same visual features. We want to isolate the effects due specifically to the design of the counting module.

Figure 3: Counting architectures. The question module (left) converts the question and object embeddings to score vectors. Each counting module converts the score vectors to counts. Here, N denotes the number of object proposals. The text in red describes the size/dimensionality of the adjacent terms. Boxes with yellow outlines point to the values used during visualization. SoftCount converts each object into a scalar value and then counts by summing those values. UpDown (modeled after Anderson et al. 2017) gives each object an attention weight, α, and estimates the count from the attention-weighted object encodings and the question encodings. Each box labeled with an ‘f’ indicates a non-linear layer. IRLC computes a set of scores, κ, and iteratively selects and object and updates the scores using the interaction matrix, ρ. The interactions are calculated from the variables shown on the right.

The simplest of the three models we test generates a count by converting each object proposal into a fractional value (by applying a sigmoid function to the score value) and then taking the sum over all of them. Because each of these values is somewhere between 0 and 1, we call this approach SoftCount (Figure 3, middle-left).

We also implement a more typical style of VQA architecture, which we design after Anderson et al. (2017)--see also Teney et al. (2017). The key to this approach is that the score values are used to compute attention weights for each object proposal. This gives each proposal a weight between 0 to 1, such that all the weights sum up to 1. Rather than take the sum of these values (as in SoftCount), we use the attention weights to compute a weighted average of the object encodings, as if combining all the object encodings into a single, representative object. We use the weighted average of the object encodings and the question encoding to estimate the count (which requires us to treat the count as a class in a classification problem). We refer to this architecture as UpDown (Figure 3, middle-right).

Lastly, we introduce Interpretable Reinforcement Learning Counter--or IRLC for short. With IRLC, we aim to learn how to count by learning what to count. We assume that each counting question implicitly refers to a group of objects within the scene that meet some criteria. In this sense, the goal of IRLC is to enumerate each of the objects in that group. This means that, when generating a count, IRLC must choose which specific object proposals it wants to include in the count. We model this process as a sequence of discrete choices.

Figure 3 (far-right) gives a schematic for the counting module of IRLC. Like the other two counting strategies, this one begins with the scores assigned to each object proposal and converts them into a set of selection probabilities, illustrated as κ. We select the object proposal with the largest κ value and add it to the counted objects. Importantly, we also model how each choice affects subsequent decisions. This is controlled by an interaction matrix, illustrated as ρ (which is the output of a feedforward neural network that takes the question, the object encodings, and their coordinates). This sets up a sequential decision process where each step has 2 phases. In the first step, we pick which object to add to the count--call this object i. Then, we take the ith row of the interaction matrix, ρ, and use it to update the selection probabilities--that is, we add the ith row of ρ to κ. Each object can only be selected once. We repeat this process until all remaining κ values are below a learnable threshold, at which point the decision is made to stop counting and the returned count is equal to the number of selected objects.

Importantly, our training data doesn’t label the objects that are relevant to the question, it only gives us the final answer. So, all we can tell our model during training is how far away the estimated count was from the ground truth. Because of this restriction and the discrete nature of the counting choices, we train using RL. Since κ can be interpreted as a set of selection probabilities, our approach is well suited for policy gradient methods. We specifically use Self-Critical Sequence Learning, which involves stochastically sampling object selection decisions using the probabilities given by κ and comparing the outcome to what happens if objects are chosen greedily (as at test-time and described above). The network is trained to increase (decrease) the probability of selection sequences that did better (worse) than the greedy selection.

Counting in VQA

Typically, studies on VQA provide performance metrics for each of 3 question categories: ‘yes/no’, ‘number’, and ‘other’. For our purposes, we need to be more precise, because the ‘number’ category includes both counting and non-counting types of questions. To deal with this, we construct a new combination of the training data from the VQA v2 and Visual Genome datasets. We isolate the questions that require counting and whose ground truth answer is in the range of 0-20. For development and testing, we apply the same filter to the validation data from VQA v2. We set aside 5000 of those questions as our test split and use the rest for development (i.e. early stopping, analysis, etc.). We call this dataset HowMany-QA.


We trained each of the model variants described above on the training split of HowMany-QA. We rely on 2 statistics to evaluate the counts produced by the models. The first of these metrics is accuracy--which is the standard metric for VQA. This simply tells us how often the best guess of the model was correct, so higher is better. However, accuracy doesn’t capture the typical scale of the error, so we also want to capture how far, on average, each model’s answers were from the ground truth. For that, we also calculate the average squared error of the estimated counts and take the square root of that value (root mean squared error, or RMSE). Since RMSE measures error, lower is better. These values are shown the following table:

Model Accuracy RMSE
SoftCount 50.2 2.37
UpDown 51.5 2.69
IRLC 56.9 2.39

A couple things stand out. First, IRLC is substantially more accurate than the other two baselines (SoftCount and UpDown). Second, the two baselines trade-off accuracy and RMSE. That is, SoftCount returns counts that are on average closer to the ground truth than UpDown (lower RMSE) but are less often exactly correct (lower accuracy). So, these two metrics capture subtle differences in the quality of the counting performance. This underscores the observation that IRLC performs strongly in both metrics.

Qualitative results

The quantitative performance of IRLC shows that a discrete decision process built on discrete visual representations is not only viable for counting in VQA but superior to alternative methods. But what about our intuition that such an approach should also simplify interpretation of model behavior? There’s no clear metric for saying whether one model is more interpretable than another, but Chandrasekaran et al. (2017) have made some progress in bringing this idea into the experimental realm. For now, we’ll form a more qualitative argument by borrowing a central idea from their work: that a key hallmark of interpretability is the ability to understand failure modes.

Figure 4: Failure cases in IRLC and UpDown. For the left example, the question asks about a subject that occurs often in the training data (“people”). In the right example, the question subject (“ties”) is rare in the training data. For IRLC, the boxes correspond to chosen objects. For UpDown, the boxes are shaded according to corresponding object’s attention weight (equal to α in Figure 3, middle-right).

Rather than provide an exhaustive set of examples for some failure mode, we illustrate a pair of failure cases that exemplify two trends observed in IRLC; these are shown in Figure 4. These two cases also led to failures in the UpDown model, so we include its outputs and attention focuses for comparison. IRLC has little trouble counting people (the most frequently encountered counting subject in the training data), but, as shown in the left example, it encounters difficulty with referring phrases (in this case, “sitting on the bench”) and fails to ignore 2 people in the foreground. In the right example, the question refers to “ties,” a subject for which the training data has only a handful of examples. When asked to count ties, IRLC includes an irrelevant object--the man's sleeve. This reflects the broader tendency for IRLC to have difficulty identifying the relevant object proposals when the question refers to a subject that was underrepresented during training. The nature of these failures are obvious by virtue of the grounded counts, which point out exactly which objects IRLC counted. In comparison, the attention focus of UpDown (representing the closest analogy to a grounded output) does not identify any pattern. From the attention weights, it is unclear which scene elements form the basis of the count returned by UpDown.

We invite the reader to evaluate for him/herself whether IRLC succeeds in providing more interpretable output than the alternative approaches. Below, we provide a broader set of examples and the outputs from each model (Figure 5). In addition, we illustrate how the sequential decision process evolves while IRLC counts for each of several examples (Figure 6).


We present an interpretable approach to counting in visual question answering, based on learning to enumerate objects in a scene. By using RL, we are able to train our model to make binary decisions about whether a detected object contributes to the final count. We experiment with two additional baselines and control for variations due to visual representations and for the mechanism of visual-linguistic comparison. Our approach surpasses both baselines in each of our evaluation metrics. In addition, our model identifies the objects that contribute to each count. These groundings provide traction for identifying the aspects of the task that the model has failed to learn and thereby improve not only performance but also interpretability.

Citation credit

Alexander Trott, Caiming Xiong, and Richard Socher. 2017.
Interpretable Counting for Visual Question Answering

Figure 5: Example outputs produced by each model. For SoftCount, objects are shaded according to the fractional count of each (0=transparent; 1=opaque). For UpDown, we similarly shade the objects but use the attention focus to determine opacity. For IRLC, we plot only the boxes from objects that were selected as part of the count. (In the model diagrams in Figure 3, the values used to shade these boxes are outlined in yellow). The number above each image refers to the answer produced by the associated model.
Figure 6. Sequential counting of IRLC. At each timestep, we illustrate the unchosen boxes in pink, and shade each box according to the values of κ at that time step (corresponding to the probability that the box would be selected at that time step; see main text and Figure 3). We also show the already-selected boxes in blue. For each of the questions, the counting sequence terminates at t = 3, meaning that the returned count is 3. For each of these questions, that is the correct answer. ‌‌In the example on the right, IRLC returns the correct count but includes 2 irrelevant objects and ignores 2 eggs. This is an example of a correct failure, which occurs when IRLC produces the correct answer but selects the “wrong” object proposals (as determined informally by eye). Identifying a correct failure is trivial with visually-grounded counts but would be extremely difficult otherwise. Importantly, compared to simply observing whether the answer was correct or incorrect, correct failures provide greater insight about what the model has and has not learned, which makes it easier to anticipate the settings where the model will fail.