Learning to Look Around

UW-Madison CS 639 Final Project
Declan Campbell, Robin Stauffer, Soham Dasgupta

Our Motivation


Humans interpret images by looking at a handful of key views, using that information to infer the contents of the entire image and to plan where to look next. Take the following image as an example. Take a look at the image and note the order in which you process it’s contents:

Note how you read each of the captions (hopefully) in a predictable order. The largest text draws our attention initially followed by the two captions below it. This is because our knowledge of how to read (top-bottom, left-right) informs the order in which we look at the captions. Finally, we read the lowest line of text because after reassessing the image we realize that we missed the smallest caption at the top.
There is an abundance of psychological studies which have described in detail how we choose where to look at scenes presented to us. In a seminal study conducted by Yarbis et al, researchers asked study participants to perform different three-minute tasks on the same image while tracking eye movement (see image below). They were first given three minutes to look at the photo with no tasks given to them (1). The remaining patterns of saccades are associated with tasks when the participants were asked to (2) estimate the material circumstances of the family; (3) give the ages of the people; (4) surmise what the family had been doing before the arrival of the "unexpected visitor;" (5) remember the clothes worn by the people; (6) remember the position of the people and objects in the room; (7) estimate how long the "unexpected visitor" had been away from the family. (Yarbus, 1967)
From this image it is evident that how the participants moved their eyes around the scenes depended very strongly on what the given task goals are. Furthermore, even when the participants were allowed to look freely at the image (1), we can see that the participants selected views still follow predictable patterns associated with the structure of the image. This sequential nature of view selection in human vision seems to underlay much of our ability to describe what we see, answer complex questions about scenes, reason about the contents of images, and more. Until recently, such tasks were thought to only be possible for humans, but over the past 5 years, an abundance of deep neural network architectures have been proposed which are able to perform these tasks very well.
Network architectures and training paradigms have been proposed which are able to perform complex sequential image processing tasks like visual question answering (VQA), image captioning, visual reasoning, and more. Current models trained to perform these types of tasks typically leverage convolutional encoder components for image processing and recurrent components for text processing and generation. To give you a sense of how this style of network functions, we’ve provided an example image depicting how a VQA processes images. (Anderson et al., 2018)
In the above image, the top row depicts the input images, the text below it describes the questions the model is tasked to answer, and the subsequent rows depict salience maps of which regions of the image the network attended to at each step when generating its answer. It’s important to emphasize how these models encode the entire contents of input images while learning to sequentially attend to different elements of these encodings. While this allows these networks to perform exceptionally well -- and often state-of-the-art in their respective areas -- there are also several drawbacks to using this style of network. For one, these networks are probably not well suited to use as candidate process models of human sequential perception given that they encode the entire contents of their input images; humans, on the other hand, instead learn spatial attention over the regions at which they look. Furthermore, these networks are computationally expensive to apply to very large images given that they must learn to encode these images with a convolutional section which sufficiently reduces the scope of the sequential attention. This particular limitation makes these networks ill suited for embedded robotics applications with high resolution sensors and limited computational resources.
Such limitations have motivated some researchers to instead construct reinforcement learning models which are trained to sequentially select subregions of large input images. These models are more efficient than the previously mentioned approach because they do not need to encode the contents of the entire image. These models were first described by Ramakrishnan et al (2019) in which they implement such a sequential model to learn “look-around” policies which generalize well across tasks. Furthermore, these networks process images in a way much more similar to humans, thus making them better process models for human active perception. Shown below is a concrete example of how these networks process images:
Here, the left most column shows the ground truth images which the network must process, the middle column shows the views selected by the agent, and the final columns represent the networks best as to what the original images looked like. This research inspired the project our group worked on this semester where our goal was to learn about these models by implementing them from scratch, applying them in a novel context, and then evaluating how the networks behaved when trained to perform different tasks.

Approach


Starting the project, we initially planned to download and “tweak” their code to match our own dataset and goals, but we soon discovered that this repository was poorly documented and that their code was not compatible with our project focus. This was mainly due to their emphasis on multi-view and panoramic datasets, where we were planning on using a human face dataset. We instead used this repository to help guide our implementation (specifically, making use of their policy gradient loss implementation and the Agent class structure). Using this as a rough outline of our project, we implemented our policy network, data pipeline, and training/testing pipelines from scratch. For our dataset, we chose to use the CelebAMask-HQ dataset given that it is a large dataset of high resolution images with categorical labels which we could later use for our experiments with the multi-task network. We decided to tackle the reconstruction task first and then move onto the multi-task policy later.

Implementation of Reconstruction Network


We task this first reconstruction network to reconstruct the input images using only information gained from the selected sub views from the larger images. Here’s a quick description of how our network works:
  1. Model is presented with an image.
  2. The convolutional image encoding pathway encodes a (32x32) pixel subregion of the larger (512x512) image. On the first time step we fix that region to be the one in the upper left.
  3. Simultaneously, the proprioceptive pathway encodes a one-hot encoded vector vector describing the location of the selected subregion.
  4. The image and proprioceptive encodings are concatenated and then integrated into what we call a “gestalt” representation of the agent's current representation properties of the entire image. (“Gestalt (adj.): an organized whole that is perceived as more than the sum of its parts.”) This “gestalt” representation is integrated over time using an LSTM layer.
  5. From the gestalt representation, the network then attempts to reconstruct the original image while also generating an action probability vector.
  6. An action is selected using these action probabilities in a manner which tends to select regions with the highest probability while also leaving room to explore the action space by choosing actions with small probabilities.
  7. On the next step, the network is provided with input associated with this selected action and new reconstructions/action probabilities are generated.
  8. This process is then repeated for N iterations. (We often chose 10 for practical reasons).
The convolutional decoder is trained at each iteration of view selection with the reconstruction error for that time step. This helps the large image decoder to converge more rapidly than backpropagating only after all views are selected. Similarly, the action probability generating section of the network is trained using a policy gradient loss derived from rewards obtained at each iteration of view selection and associated with those iterations reconstruction losses.

Implementation of Task Network


Towards the end of the semester, we trained an additional network which was similar to the reconstruction network. Here, instead of reconstructing the image, at each timestep, we train the network to identify categorical features represented in the images. For instance, the network may be tasked with identifying whether the individual depicted in the image has earrings, is smiling, or is bald. The network architecture used was almost identical to the reconstruction network, but had a fully connected output layer in place of a convolutional decoder. Unfortunately, due to a combination of factors including the non-stationarity of action probabilities during training, and the long time it took to train each network variant, and a group member contracting COVID-19, our efforts to train a functional task network were ultimately unsuccessful.

Results


These are a few examples of the output produced by the reconstruction network. The green squares show which subregions of the images were analyzed at each time step, and subsequently you can see the image reconstruction improve at each of these time steps.
Our model was able to learn important factors of variation, most commonly succeeding at features like head angle, background color, and hair color. Some example images and reconstructions depicting how the network succeeded at representing these features are given below:
The reconstruction of the woman in the top left image shows the correct head angle, most likely realizing this difference in head angle from the two points that fall along the side of her face. The woman in the top left also has the correct face shape as the subregions selected fell along the side of her face.
This next figure shows reconstructed images of the same three women at different time steps, a broken-down version of the gifs shown above. The quality of these reconstructions for every image improves after each view is selected, suggesting that the network is meaningfully integrating information from separate views into a gestalt representation which captures the content of the overall image. In the center column of the figure, you can see hair color, skin color, and face angle become more accurate as each sub-view is selected. One clear example of a single sub-image that has an impact on the reconstruction as a whole can be seen in the third column of the figure, third row down. Here you can see that the new selected sub-image includes part of the woman's forehead and part of her bangs. The corresponding reconstructed image now shows bangs in that area!
As stated before we do not have any results to share from our feature labelling model, as the results produced are not yet accurate. We are experimenting with ways to combat this issue and discuss some of these solutions in the next section.

Problems Encountered


Our reconstruction model performs fairly well, but it seems to overfit to the “average” face. To counteract this, we hypothesize that switching from pixel-wise mean squared error to inception loss may help reconstructions to be more semantically consistent with the target image, and would help the model to emphasize feature similarity rather than pixel similarity. Pixelwise mean squared error is simpler to implement, but doesn’t perform as well in the long run because it can easily fall into issues such as putting emphasis on semantically uninteresting but visually distinct aspects of the image (e.g. background variation).
Our reconstruction model also seems to overexploit rewarding views later in training, falling into local maxima in the reward landscape. One solution to this issue is to use explicit supervision early on in training. This might mean adding an additional loss term which helps shape the action probabilities to be closer to the ground truth “best” action sequences under a globally optimal policy where the optimal policy is determined by training an end to end (32x32)=>(512x512) view->image decoder to find which views which are most informative and useful for reconstructing the contents of the whole image. As stated before, our Feature labelling model is not yet able to produce accurate results. One theory that we have as to why this is not working is that the action probabilities are non-stationary throughout training, making it difficult for the network to learn the spatial location of features informative to different labels.

Future Research Using This Model


Once our models (both reconstruction and multi-task networks) are working consistently, there are numerous applications we would like to explore. As mentioned in our initial motivation, we would like to train on more difficult tasks and combinations of tasks. It would also be interesting to run both the reconstruction task and the feature labelling tasks concurrently to investigate how the networks manage to balance multiple goals. Will the model prioritize one over the other, will they equally fall short? We are additionally interested in running this training policy on data sets that aren’t as straightforward as faces to see if it would still be able to reconstruct images accurately and perform tasks. Further interesting emerge here: will the network be able to encode and learn the spatial co-occurrence of features across image domains, or will this make the training process more unstable and slow? Ramakrishnan provides some initial results suggesting that such policies generalize well across the task that they examined, but they did not systematically evaluate how these policies may vary across image domains with different features and statistics. Clearly, there are many interesting extensions to this project that we hope to pursue in the future.

Citations


  • Ramakrishnan, S. K., Jayaraman, D., & Grauman, K. (2019). Emergence of exploratory look- around behaviors through active observation completion. Science Robotics, 4(30).   [doi:10.1126/scirobotics.aaw6326]
  • Yarbus, A. L. (1967). Eye movements and vision. New York: Plenum Press.
  • Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.   [doi:10.1109/cvpr.2018.00636]