Robot Depth Estimation Inspired by Fixational Movements

Distance estimation is a challenge for robots, human beings, and other animals in their adaptation to changing environments. Different approaches have been proposed to tackle this problem based on classical vision algorithms or, more recently, deep learning. We present a novel approach inspired by mechanisms involved in fixational movements to estimate a depth image with a monocular camera. An algorithm based on microsaccades and head movements during visual fixation is presented. It combines the images generated by these micro-movements with the ego-motion signal, to compute the depth map. Systematic experiments using the Baxter robot in the Gazebo/ROS simulator are described to test the approach in two different scenarios and evaluate the influence of its parameters and its robustness in the presence of noise.


I. INTRODUCTION
T HE HUMAN visual-oculomotor system is a source of inspiration for solving problems pertaining to visual perception in robotics. Many species exhibit behaviors that require accurate depth estimation in their environments. In particular, primates solve this problem by the concurrent use of multiple estimators deriving from different visual cues [5]. Even so, the most popular sensors used in robotics nowadays to obtain this information are arguably RGB-D sensors, such as Microsoft Kinect [13] and any of its variations [19]. They are typically based on the projection of a known infrared pattern and, depending on the objects in front of the sensor, the deformation of this pattern is used to estimate the depth. In computer vision, a number of methods and algorithms have been established for determining the depth of a scene using a single camera or image. For instance, by applying patches to determine the pose of planes in a single image it is possible to generate the depth map with a single image [30]. Also, from a stream of images, the depth map can be deduced if the velocity of the camera is known [22]. Recent results about obtaining structure from motion with a monocular camera are based on feature tracking and triangulation methods [24], [31]. In addition to these methods, novel deep learning approaches use complex neural network architectures to learn the correlation between an RGB image and its equivalent RGB-D in an unsupervised way [8], [25]. All of these procedures have in common that they only consider the visual cues as inputs, ignoring the motion of the camera, and sometimes even computing it from the images. RGB-D sensors and deep learning techniques have certain drawbacks. In the case of the former, objects with absorption in the infrared range are not detected, and these sensors have also problems outdoors and with reflective and transparent objects. From a more practical point of view, robotic manipulation in a confined space requires a streamlined design with a sensor in hand and, even though state-of-the-art RGB-D cameras are more compact, they are not an option compared with the fully integrated built-in eyein-hand camera we use in our experiments. In the case of deep learning, long training processes along with large and pertinent data sets are necessary; moreover, a number of specific problems arise when this technique is applied in robotics [9].
In humans, it has been suggested that the retinal image motion is not enough to determine the depth sign in reference to the fixation plane, and the direction of the image movement relative to the observer motion is decisive to obtain this depth sign [23]. A number of species use eye movements in coordination with small displacements of the head during the process of visual fixation to obtain depth information of the gazed scene [4]. The "fixation" process is anything but fix, since during maintained fixation tiny intersaccadic eye movements around the gazed location are produced. A large fraction of these movements is smooth, but seemingly random changes in eye position occur: so-called ocular drift and ocular tremor, respectively [17]. Moreover, during maintained fixation, very small saccades (microsaccades) are generated with variable frequency and amplitude. Even though microsaccades and saccades exhibit similar motor characteristics and share a common neural substrate [16], there has been a long controversy over the visual functions of these movements. Recent studies show that these microsaccades are precisely directed and play a fundamental role in enhancing visual acuity [14].
These ideas from biology-inspired earlier work in robotics for distance estimation based on the parallax produced by camera rotations [29] and compensatory head/eye movements [18]. In later works, the concept is extended to depth estimation [1]. Although Antonelli et al. based their work on the coordination of the neck and the oculomotor system to maintain the fixation point, they did not consider microsaccadic movements [2], [3].
In this article, we build on our earlier work [7] on monocular depth estimation taking inspiration from mechanisms involved in fixational movements in humans and primates, namely, microdisplacements of the head and microsaccadic movements. The key idea is to consider the images after micro-movements as perturbations of the initial fixation image and use them, in combination with the ego-motion signal, to generate the depth map. Our preliminary results suggested that the approach was able to satisfactorily estimate the depth in a scene, and that microsaccades play an essential role in this process [7]. Here, we consolidate our procedures and expand our results. First, the mathematical model is presented in detail in Section II including its algorithmic implementation. Then, in Section III, a set of systematic experiments using the Baxter robot in the Gazebo/ROS simulator are thoroughly described, with the purpose of evaluating the approach in two different scenarios, and studying the influence of its parameters and its robustness in the presence of noise. The results are evaluated in comparison with the depth provided by the simulator as ground truth, and also with the detector algorithm for the Aruco visual markers located in the scenario. Finally, these results are discussed in Section IV.
II. MODEL

A. Model Hypothesis
The human fixation mechanisms are the source of inspiration to develop the proposed model which is sustained by these hypotheses.
1) During the fixation process, head-eye movements can be considered as perturbations around an initial pose. A complex set of coordinated movements implicating the head and the oculomotor system are generated in the fixation process [4]. The aim of these movements is to maintain the gaze point in spite of random displacements of head and eyes during fixation. 2) So-called visual suppression occurs during microsaccadic movements [12] to the effect that only in the intersaccadic gaps is visual information accessible. In consequence, the fixation process can be regarded as a spatial image sampling. 3) The main cue for the estimation of depth and 3-D perception is the optical flow produced by the observer. When it is not the result of external movements, optic flow and motion parallax are consistent when other depth cues are not available [11]. 4) The contribution generated by the ego-motion signal makes it possible to clarify the inherent ambiguity associated with the optical flow [15].

B. Mathematical Model
In the beginning, there is no information about the depth of the scene. We consider an initial gazed point. Then, the visual  system moves to the initial fixation pose and the image that is received by the visual system is taken as reference. After that, microsaccades and head movements start. The source of movements is only produced by the visual system, and the scene is considered static during this fixation process. Due to perception suppression, images are not considered during saccades. When the saccadic movement has just finished the image received by the visual system is compared with the reference image and depth perception is updated with this new information. To simplify the problem, we consider a range of distances in the scene defined by a near plane Z n and a far plane Z f which are perpendicular to the Z visual system axis. Therefore, depth perception takes place within this range. A schema of this behavior from a robotic point of view is shown in Fig. 2. When t = 0, the camera has a pose to look at the gazed point. This is the starting point of the fixation process with the initial image of reference I 0 . Given a point of the scene (P 0 ) that generates an intensity value in that image and projecting it onto the image plane, the pixel {x 0 i , y 0 i } is obtained. The z-axis of the visual system is aligned with the gaze point and Z 0 is the value of the depth in P 0 , understanding depth here as the distance from the camera frame of reference {C 0 } to the perpendicular plane to the camera z-axis containing P 0 .
After a head movement and a microsaccade (t = t), the gazed point has been displaced, and the new camera pose is aligned with this new gazed point. The depth with respect to (w.r.t.) the new camera frame {C t } has changed. After the microsaccade, a new image I t is obtained. The original P 0 is now P t w.r.t. the new camera frame and its projection on the image plane corresponds to a new pixel position in the image {x t i , y t i }. An optical displacement has taken place in the image plane O f = {Sx, Sy}. This value can be estimated by computing the optical flow between both images.
Our aim is to determine the value of Z 0 that corresponds to depth sensation in the fixation point. In order to reach this goal, we define several matrices and vectors in homogeneous coordinates. The pixel coordinates in I t and I 0 are defined by vectors where u and v are expressed in the centred image coordinates system. We define a projection matrix K that is a function of the camera parameters, mainly of the focal lengths. To simplify the model where f is the focal length of the camera. To work in homogeneous coordinates, we define two matrices that depend on the depth value: Thus, there are two such matrices, one for the initial camera pose H(Z 0 ) and the other one for the other pose H(Z t ). Finally, regarding roto-translation matrix between the frames, we consider that the angular variation is small enough to approximate the rotation by using the skew matrix M; in addition, the translation matrix T is given by the Cartesian difference between {C 0 } and {C t }. These matrices are defined in where W (x,y,z) is the angular variation in each axis. The rototranslation matrix RT is defined as a composition in If the ego-motion signal is known by means of T and M, the new pixel position in the image plane m t can be computed by using expression (3) from the reference image pixel position The value of Z t can be obtained from the expression: P t = RT· P 0 , and taking into account that From (3) and (4), it can be concluded that m t is only a function of the camera parameters (f ), the ego-motion components ({ X, Y, Z, W x W y , W z } and the initial depth Z 0 . When the scene is considered static, the apparent displacement produced in the image of pixel m 0 is only originated by ego-motion, therefore, (5) must be satisfied However, given that both the ego-motion and the optic flow (O f ) can have an error in their estimations, we can rewrite (5) as where represents the accumulative error resulting from computing m t using expression (3), and also includes the optic flow estimation error. m 0 is known, since it is the initial pixel position in the reference image, whereas m t can be calculated from expressions (3) and (4). is a vectorial magnitude and thus, a cost function based on its module can be defined as If we assume that the value of Z 0 is not correct and the errors corresponding to optic flow components {S x , S y } and ego-motion estimation are approximately constant, the greatest contribution to the value of is the undetermined knowledge about Z 0 . If Z 0 were the optimum value for the cost function defined in (7), it could be computed using expression Deriving (7) w.r.t. Z 0 , we obtain It is useful to define these expressions to implement (9) Then,m t , m 0 , andÔ f in (9) can be expressed aŝ The derivative ofm t w.r.t. Z 0 can be written as where From the above equations, it can be concluded that the value of the derivative of the cost function depends only on C ← HeadEyeMovement() 5: S, T ← getEgomotion(C 0 , C) 8: for i = 1 to h do 9: for j = 1 to w do 10: 16: t ← t + 1 17: end for 18: end for 19: Under these conditions, depth estimation has been converted into many independent optimization problems (one for each image pixel). This fact conditions the method of optimization to use. Even though simple stochastic gradient descent (SGD) could solve it, it would be necessary to define a different learning rate for each optimization problem since each pixel from the initial image is independent of the rest. Probably this learning rate could depend on the real Z value corresponding to each pixel. Consequently, gradient-based methods that work at a constant learning rate are discarded. The learning ratio must be adapted in each iteration for each pixel. Another aspect to consider is the noise in the signals for the gradient calculation. Due to the estimation method, the optical flow has an inherent variability especially in areas where there is an absence of texture. In addition, the position increase is estimated from self-perception data which may also present some noise.
A gradient descent method that can deal with these two issues to successfully compute Z * 0 , is the ADADELTA method [32]. This algorithm is based on SGD with and adaptive filter, but it also introduces several filters in the estimation of the gradient and second derivatives. These filters can reduce the noise influence.

C. Depth Estimation Algorithm
Algorithm 1 implements the above mathematical formulations inspired by the fixation process. The starting point is the reference image (I 0 ) and camera pose (C 0 ) captured at the time of the initial fixation process. Initially, no depth information is available, therefore, all pixels in the image are assigned the same value Z n . When the fixation process has begun, the movements of the head and the oculomotor system generate displacements in the image (I t ) and in the camera pose (C t ). That is, the microsaccades used to carry out the sampling. The initial image I 0 is correlated with each new obtained image I t using the Lucas-Kanade method [21]. From here, the algorithm iterates for each image pixel, updating the gradient descent computation with the ADADELTA equations.
As the algorithm advances, the received information increases the sense of depth in the image that corresponds to the initial fixation point. Ultimately, this increase in information is represented in the algorithm by the term G t (i, j), which in turn depends on the cost function according to (9). Thus, if there is no optical shift between the current image and the reference image (I = I 0 ), there is no improvement in-depth estimation knowledge.
From a computational complexity point of view, each pixel is visited once in each iteration, as shown in Algorithm 1, and the computations made on each pixel only depend on the state of that pixel in the previous step, the optical flux estimated on this point, and the camera displacement. Therefore, the temporary asymptotic cost in this part of the algorithm is O(N ), where N is the total number of pixels in the image. Regarding the asymptotic spatial cost, for the complete algorithm, it is necessary to store the resulting depth image, the optical flux components in each iteration, the initial image, and the current image. Therefore, the spatial cost has a magnitude of (5N ). This algorithm is amenable to parallel computing, since each pixel is independent of the previous and current states of the rest of the pixels. This allows it to be implemented using parallel computing techniques on either GPUs or CPUs.

D. Algorithm Parameters
As it can be seen in the description of Algorithm 1, it is necessary to set a number of parameters for its proper execution: Z 0 is the initial distance for all image pixels; ρ acts as the coefficient of a low pass filter for the gradient adaptation and its derivative, and σ regulates the gain of the gradient variation in each step. Due to the fact that gradient descent techniques do not differentiate between local and global minima, the selection of these parameters is important to obtain good quality results.
In addition, if the span of the work area is known, the limits of the search can be defined a priori; if the sought minimum lies outside these limits, the algorithm will not converge. Also, the noise factor affects its performance to the effect that the values it generates may be outside these limits. In such cases, it is necessary to define an action policy for the pixels in which this phenomenon occurs.

A. Experimental Setup
Evaluation tests are carried out with the Baxter robot in the Gazebo/ROS simulator. Given the degrees of freedom of the Baxter's head, it is not possible to replicate with it the movements of the primate's oculomotor system. Instead, we use the 7-DOF arm of this robot with an eye-in-hand camera.
Although some robotic systems described in the literature could perform this task correctly [18], [28], the design of the experiments based on this specific platform was developed in the context of the RoboPicker [6] project for which a low-cost robot is called for and manipulation takes place in a confined space; this will also allow for future experiments with this real system. The basic function of the arm in our experiments is to move the camera in such a way that it maintains orientation and it positions itself in the same way that a human eye would perform fixational movements. Baxter's wrist camera can be configured in several ways. Of all the possible ones, we chose a resolution of 900 × 600 pixels and a focal length of 405.7. We set up the same parameters for the camera simulation. In addition, we added white Gaussian noise to the image in order to introduce uncertainty in the optical flow computation. This value is common to all performed experiments and is equal to 0.01 pixels.
The space of movements for the camera is specified as a sphere defined by two parameters: 1) the central point and 2) the radius of movements r m (see Fig. 1). r m is considered constant in order to reduce the number of experiments, with a value of 0.015 m according to the order of magnitude in the experiments of Aytekin and Rucci [4] (these authors suggest r m is not uniform and its value depends on the distance to the fixation point). A controversial point in the literature is the maximum radius of a microsaccade. Some studies set this value between 1 • and 2 • . However, most microsaccades have a magnitude smaller than 0.5 • for many tasks [27]. The parameter r g is defined by the microsaccade amplitude as shown in Fig. 1. Taking an amplitude of 0.5 • , r g varies with the fixation point distance as

B. Experimental Procedure
Based on the fixation process, we define a procedure that is used in all experiments.
1) An artificial scenario is placed in front of the wrist camera of the Baxter robot. [Fig. 3(a)] and a starting point of the camera for the fixation process is selected. 2) In order to simplify, three distances are selected for the fixation point in the scenario, all of them on the same axis Z from the camera. In addition, the microsaccade radius r g is computed as a function of that distance (14).
3) The initial image I 0 and pose C 0 are saved. 4) The camera starts to move randomly within a sphere of radius r m maintaining the fixation point projected onto the image plane within the circle of radius r g . 5) The successive images (I t ) and poses (C t ) are compared with the initial image and pose by applying Algorithm 1.

C. Evaluation Methods
We consider two kinds of scenarios to test Algorithm 1. The first scenario is used to evaluate the accuracy of depth estimation and the influence of the algorithm parameters. The performance of the algorithm is evaluated using two kinds of backgrounds. The first one uses the depth image generated by the simulator [Fig. 3(c)], as the most accurate depth estimation ground truth. The second background utilizes 6 squared plates placed in the simulated scenario on which 6 Aruco markers are printed [10]. The type of markers and their relative position w.r.t. the initial camera location are shown in Table I and Fig. 3(b). The error is estimated from the standard deviation (STD) after 30 measures for each marker position using the Aruco markers detector algorithm. These error values provide information about the repeatability of the measurement, not its accuracy w.r.t. the background. Using the Aruco markers, we can estimate the depth of each marker plane. In addition, each marker encloses an image area where the depth should be approximately the same. Therefore, applying this mask to the obtained depth image and computing the mean and STD for each marker area, the result must be comparable to the distance estimated by the Aruco detector.
The second evaluation scenario is composed of a number of simulated objects with different shapes and textures in the same setup. Then, the obtained depth image is compared in each iteration with the real one using the mean square error between them. Several Aruco markers are also introduced in this scenario to be used as control points.

D. Experimental Tests
We present several experiments that have primary objective as the evaluation of the proposed algorithm. As additional secondary goals we intend: 1) to study the influence of the choice of parameters on the performance and results of the adaptive process; 2) to evaluate the effect of a plausible Gaussian error in the inputs of the algorithm; and 3) to validate the algorithm in an environment with ordinary objects. To avoid shifts in the image due to changes in perspective and to keep the set of control markers within the scene in all images, three virtual fixation points were selected at different distances from the initial position of the camera-which is the same for all experiments-d = {0.3, 0.6, 0.9}(m). We try to avoid any interference produced by the choice of fixation points within the environment. In this way, it can be assured that all Aruco markers will appear in almost all images and therefore it is possible to track and compare with them in each iteration. Considering that the final objective is to obtain a depth estimation as similar as possible to the image generated by the simulation of the depth camera, two of the criteria used to evaluate the results are the structural similarity index (SSIM) [33] and the global mean square error (MSE), along with the STD between the depth images in each iteration. To check whether there exist differences in the performance of the algorithm depending on the depth, the comparison between the estimation of the distance in the planes defined by the Aruco markers and the one estimated by the algorithm in each iteration is used. Moreover, since the exact position of each plane corresponding to each marker is known, this value is compared to the estimation of the markers and the results of the algorithm.
In addition, we defined a policy regarding how to proceed when the estimated value of the distance lies outside the defined limits of the work area. This can occur when there is an error in the optical flow estimation or in the position variation. One option was to reset its value to the initial distance or, alternatively, to decide not to adapt the value of Z * t . After several tentative tests, this second policy was implemented.
1) Influence of the Choice of Parameters: It can be observed from the adaptive part of the proposed algorithm that ρ acts as the smoothing coefficient in an exponential mean filter, both for the gradient square and G(i, j) t adaptation. The choice of ρ must be modulated by the possible noise that the estimation of the gradient and its derivative may present. It can be assumed that this noise has a similar effect on all depth image pixels, therefore, the value of ρ is taken as the same for all of them.
The σ parameter-as defined by Zeiler [32]-has a regularization function to prevent a zero value for the denominator of the τ t estimate. Its importance changes depending on the relative value of the estimation of the square of the gradient w.r.t. the σ value.
To study the influence of both parameters, we fix the rest of the system variables. Thus, the fixation point is placed at 0.6 m; 0.1 m is assigned as the initial value of Z for   Figs. 4 and 5. From the analysis of these plots, it is apparent that when σ is kept constant and the value of ρ is changed, the final MSE and STD is similar for all cases (in Fig. 4, the mean of the last 50 iterations is 0.0169 ± 0.0787 m). Also, these results suggest a behavior for the influence of ρ for a constant σ , in the sense that the lower ρ is, the faster the algorithm converges (around 30 iterations for ρ = 0.4). In principle, it seems that the lower σ is, the better the obtained results are (Fig. 5). This trend, however, has a limit and for a very low sigma, the results are poor.
In addition, we studied the influence of the initial value of Z on the algorithm results. Thus, we fixed the rest of the parameters and varied the value of Z 0 . The obtained results are shown in Fig. 6. It can be checked that the convergence to the final result seems faster the higher the value of Z 0 .
2) Influence of Noisy Input Signals: Even though the gradient descent algorithm takes advantage of parameters ρ and σ to filter in some way the noise in the input signals, still the estimation of the gradient and its derivative is severely affected by this noise. To asses the impact of this issue, we introduced in our experiments a Gaussian error in the image that directly affects the precision in obtaining the optical flow. This error acts on each pixel individually, whereas the error in the estimation of the displacement of the camera affects the calculation of depth in all pixels.
From this point of view, we used the same experimental conditions, that is, radius of movements (r m = 0.015 m); fixation point at distance 0.6 m; same captured RGB images and camera displacements. However, in each iteration we disturb the camera displacement computations with white Gaussian noise affecting its rotational and translational components, and characterized by STDs φ r and φ t . The chosen values for φ t are φ t = {0.0001; 0.00050; 0.0010} m that represent {1.2%, 6.6%, 13.2%} of the maximum possible displacement respectively. Moreover, the selected values for φ r are φ r = {0.005 • , 0.1 • , 0.3 • }. After the execution of the algorithm, the obtained results are shown in Fig. 7. Gray-scale representations of the final depth images for the best and worst cases are shown in Fig. 8.
3) Aruco Marker Comparison: The purpose of using Aruco markers in the simulation is twofold. First, to create surfaces where the distance to the camera is known, and second, to build a scenario that is easy to test when moving from simulation to reality. In simulation, it is straightforward to compare the results since the distances from the markers to the camera frame are perfectly known. It is also possible to check the predictions that the Aruco algorithm makes for these known distances. In order to test these errors we applied our algorithm in the same scenario but only varying the distance of the fixation point, and keeping the rest of the parameters constant for all the tests. In Table II, the obtained results are shown. The  first column lists the relative errors between the estimation of the algorithm and the distance predicted from Aruco markers. The second column compares the relative errors between  II  RELATIVE ERRORS BETWEEN DEPTH ESTIMATED BY THE ALGORITHM  AND THE DISTANCE PREDICTED FROM ARUCO MARKERS, AND BETWEEN  THE ESTIMATION OF THE ALGORITHM AND THE DISTANCE GIVEN BY  THE SIMULATOR, FOR THE SIX ARUCO MARKERS AND THREE  the estimation of the algorithm and the distance given by the simulator. A graphical example is shown in Fig. 9.

4) Environment With Ordinary Objects:
So far, only a simplified scenario has been considered, which has made it possible to evaluate the accuracy of depth estimation as well as the influence of the algorithm parameters and noise. All the surfaces involved were planes perpendicular to the camera. In order to test the effectiveness of the algorithm in other more complex environments, models of several objects have been chosen [26] and arranged in front of the camera in the same way as the markers. The RGB image for this scenario is shown in Fig. 10(a). There are various types of objects in terms of shape, texture, and transparency. the corresponding depth image as generated by the simulator is shown in Fig. 10(b); it will be used as reference ground-truth image.
To evaluate the results we used MSE as we did before. Due to the fact that MSE can yield misleading results in certain circumstances, SSIM was also used. SSIM provides us with information about the structural similarity between the depth image generated by the simulator and that estimated by the algorithm. For these tests, the parameters were given the values for which the best results in Fig. 8 were obtained without white noise error. Namely, ρ = 0.5 and σ = 0.005. The evolution of MSE and STD is shown in Fig. 11(b) with the expected behavior. Fig. 11(a) illustrates the evolution of SSIM index along the whole adaptive process. The range of possible values for SSIM extends from 0 to 1, being more similar the closer it is to 1. In this case, the variability of that index oscillates between 0.75 and 0.85 at the end of the algorithm iteration for all selected fixation points, comparing with the ideal depth image represented in Fig. 10(b).

IV. DISCUSSION
The results of the above experiments allow us to asses the robustness of the algorithm in relation to the addition of noise, as well as the influence of the parameters on its performance. In contrast to [7], here we evaluate the algorithm more thoroughly. For this reason, MSE and STD are used to measure the average error over the whole image. Using the simulator makes it possible to have a perfect background in order to compute MSE and STD. We can also compare the results of our algorithm with those generated by the Aruco markers in the simulation.

A. Parameter Selection
In view of the results shown in Figs. 4 and Fig. 5, it is apparent that, as expected, ρ parameter acts as a filter causing the stabilization of the final results in exchange for the number of iterations to reach them. On the other hand, the behavior of σ is more complex. As it can be seen in Fig. 5, for a given ρ, the lower the value of σ the better the overall result. However, if σ becomes too small, the algorithm gets frozen (purple line in Fig. 5). This is also the case for too high values of ρ (green line in Fig. 4). In any case, the choice of σ and ρ should be made jointly since the closer ρ is to 1-and, therefore, filters more-the higher the value of σ should be. Finally, the initial value Z 0 only conditions the moment of reaching a more or less stable result, as shown in Fig. 6, but it does not seem to have an effect on the final depth image.

B. Noise Addition
By adding a Gaussian error to the estimation of the camera position, that affects equally each pixel of the final depth image, we are pushing the application of the algorithm to the limit. Notwithstanding, it is in this case when the effects of σ and ρ become more apparent. These experiments serve also the purpose of establishing the error limits when applying the algorithm to a real robot. Our results raise some points for discussion: 1) the lower σ is, it seems that the more robust the performance is in all cases; ii) increasing ρ tends to stabilize  the algorithm results in some cases, depending on the value of σ , and with a limit (as in Fig. 4 where for a value of ρ = 0.99 the algorithm hardly progresses); and iii) the uncertainty when the added noise error is too high generates nonvalid results.
As it can be seen in Fig. 8 in a qualitative way, the added error has a manifest influence in the quality of the results.

C. Aruco Markers Comparison
As suggested in [20], the accuracy of the Aruco markers decays with distance. Table II shows that the greater the distance from a given marker, the results generated by the algorithm are closer to the simulator ground truth than to the values provided by the markers. This suggests that for this particular case the proposed algorithm is less sensitive to error variation with distance than the Aruco markers.

D. Real Object Simulation
Both the numerical results obtained from the analysis of Fig. 11(a) and (b) as well as the qualitative results derived from Fig. 12 show that the proposed algorithm is able to determine the depth image of a more complex scenario. Thus, the evolution of SSIM and MSE is analogous and reaches the convergence value at iteration 40 for all cases. It is remarkable that SSIM reaches a value of about 0.80 which indicates a very high structural similarity with the reference image. It is also important to highlight the behavior of the algorithm w.r.t. nontextured objects such as the night lamp or the camera for which the determination of the optic flow involves more difficulties. It is also noteworthy the behavior for semi-transparent objects, such as the wine bottle or the beer mug handle, where the algorithm gives good results which could not be obtained, for instance, with standard depth sensors.

V. CONCLUSION
In this work, we have stated several hypotheses based on monocular human visual fixation. We have proposed a model that from an initial image and camera pose is able to estimate its depth map by considering the optical displacement in the images induced by the different poses of the camera as a consequence of eye-head movements inspired by those involved in human fixation, namely, microdisplacements of the head and microsaccadic movements. It is important to highlight the fact that our algorithm is agnostic to the specific robot hardware as long as it is able to replicate the described 3-D camera fixation movements. In consequence, our conclusions can be extended to other robotic platforms since it is only necessary to know the pose of the camera at each instant from the robot proprioception, and compare it with the initial pose. We have studied the behavior of the proposed algorithm in several scenarios in order to quantify its stability w.r.t. noisy input signals and how its parameters influence its performance, both qualitatively and quantitatively. Our good results pave the way for the implementation in a real robot.