The ability to selectively attend to relevant stimuli while filtering out distractions is essential for agents that process complex, high-dimensional sensory input. This paper introduces a model of covert and overt visual attention through the framework of active inference, utilizing dynamic optimization of sensory precisions to minimize free-energy. The model determines visual sensory precisions based on both current environmental beliefs and sensory input, influencing attentional allocation in both covert and overt modalities. To test the effectiveness of the model, we analyze its behavior in the Posner cueing task and a simple target focus task using two-dimensional(2D) visual data. Reaction times are measured to investigate the interplay between exogenous and endogenous attention, as well as valid and invalid cueing. The results show that exogenous and valid cues generally lead to faster reaction times compared to endogenous and invalid cues. Furthermore, the model exhibits behavior similar to inhibition of return, where previously attended locations become suppressed after a specific cue-target onset asynchrony interval. Lastly, we investigate different aspects of overt attention and show that involuntary, reflexive saccades occur faster than intentional ones, but at the expense of adaptability.
The following videos demonstrate the model's behaviour in the Posner cueing task and the Overt attention task. The visual sensory input is on the left, and the visual prediction with target (red arrow) and covert focus (green dot) beliefs is on the right.
The model is cued endogenously for 50 steps, and after a cue-target onset asynchrony period of 100 steps the target is shown.
The model is cued exogenously for 50 steps, and after a cue-target onset asynchrony period of 100 steps the target is shown.
The model is cued is presented a static object and moves the camera to focus it in the center of the image.
The encoder consists of a convolutional layer 3 × 3 (in channels: 3, out channels: 32), followed by four residual downsampling blocks (32→64, 64→128, 128→256, 256→512). A fully connected layer maps the 512-dimensional feature vector to 64, followed by another producing a 2 × 8-dimensional latent space output. The decoder mirrors this structure, with a fully connected layer expanding 8 to 64, reshaped into a 512 × H/16 × W/16 feature map, followed by four residual upsampling blocks (512→256, 256→128, 128→64, 64→32) and a final 3×3 convolutional layer (32 output). The VAE was implemented and trained in pytorch on 240,000 32 × 32 × 3 images randomly generated in the Gazebo simulator. The latent space was disentangled with manual encodings of sphere’s image coordinates for each of the training images.
The VAE was implemented and trained in pytorch on 240,000 32 × 32 × 3 images randomly generated in the Gazebo simulator. The latent space was disentangled with manual encodings of sphere’s image coordinates for each of the training images. Hyperparameters are available in the config file in the code.
The simulation environment use is Gazebo. All of the training data and experiments were collected and executed in the same environment of a camera model and a red sphere.
To begin a simple instance of the active inference agent run the following 4 commands in separate terminals:
ros2 launch gazebo_ros gazebo.launch.py world:=worlds/static_test.world
ros2 run camera_orientation turn_cam
ros2 topic pub /needs std_msgs/msg/Float32MultiArray "{data: [0.0, 0.0,1.0]}"
Note: edit the first two elements for the cue position, and the third for cue strengthros2 run aif_model act_inf
Usage:
- Enter advances the simulation by one step
- s sets the automatic step counter and advances the simulation by the given steps
- c runs the simulation for the set step count
To start the auto trial node for the Posner paradigm, or the overt attention trial, run the following 3 commands in separate terminals:
ros2 launch gazebo_ros gazebo.launch.py world:=worlds/static_test.world
ros2 run camera_orientation turn_cam
ros2 run aif_model auto_trial
You can edit the following arguments for the last command:
- trials: number of trials. Default is 1
- init: step duration of the initialization phase. Default is 10
- cue: step duration for the cue phase. Default is 50
- coa: step duration for the cue-target onset asynchrony phase. Default is 100
- max: maximum number of simulation steps after target onset. Default is 1000
- endo: boolean value indicating if trial is endogenous. Default is True, set False for exogenous
- valid: boolean value indicating if trial is valid. Default is True, set False for invalid
- act: boolean value indicating if action is enabled in the trial. Default is False, set True for overt attention
This research has been supported by the H2020 project AIFORS under Grant Agreement No 952275
BibTex Code Here