Validity and reliability of the International fItness scale (IFIS) in preschool children

ABSTRACT Objectives: Examine the validity and reliability of parent-reported International FItness Scale (IFIS) in preschoolers. Method: A cross-sectional study of 3051 Spanish preschoolers (3–5 years). Fitness was measured by PREFIT battery and reported by parents using an adapted version of the IFIS. Waist circumference was evaluated, and the waist-to-height ratio (WHtR) was calculated. Seventy-six parents of randomly selected schoolchildren completed the IFIS twice for a reliability assessment. Results: ANCOVA, adjusted for sex, age and WHtR, showed that preschoolers who were scored by their parents as having average-to-very good fitness had better levels of measured physical fitness than those preschoolers who were classified as having “very poor/poor” fitness levels (18.1laps to 22.1laps vs 15.6laps for cardiorespiratory fitness; 6.6 kg to 7.5 kg vs 5.3 kg for muscular fitness-handgrip-; 71.7 cm to 76.4 cm vs 62.0 cm for muscular fitness-standing long jump-; 17.2s to 16.2s vs 18.2s for speed/agility; and 11.2s to 15.6s vs 8.7s for balance; p < 0.001). The weighted kappa for concordance between parent-reported fitness levels and objective assessment was poor (κ ≤ 0.18 for all fitness measures). Overall, the mean values of the abdominal adiposity indicators were significantly lower in high-level fitness categories reported by parents than in low-level fitness categories (p < 0.05). The test-retest reliability ranged from 0.46 to 0.62. Conclusions: The reliability of the parent-reported IFIS are acceptable, but the concordance between parents reported and objectively measures fitness levels is poor, suggesting that parents’ responses may not be able to correctly classify preschoolers according to their fitness level. Highlights The convergent validity and reliability (test-retest) values of the IFIS parent scale are moderately acceptable for assessing physical fitness in children aged 3–5 years. However, the results of concordance show that criterion validity is poor suggesting that parents’ responses may not be able to correctly classify preschoolers according to their fitness level. Considering that the fitness level at these ages is fairly homogeneous, it seems difficult for parents to discriminate between the fitness levels of their children. Therefore, it seems necessary to recalibrate the scale in future work.


Introduction
Physical fitness is understood as the functional capability of body systems that allow performance of daily living activities and sports without effort according to age (Ortega et al., 2008). Good physical fitness level is considered an important marker of current and future health in youth (Ortega et al., 2008). In this regard, several studies have suggested that low levels of physical fitness in childhood are associated with an increased risk of cardiovascular disease and with musculoskeletal disorders and mental health problems in adulthood (García-Hermoso, Ramírez-Campillo, & Izquierdo, 2019;Lang et al., 2018;Ortega et al., 2008;Ruiz et al., 2009). Some anthropometric and socio-demographic factors (such as adiposity, physical activity, age or gender) are associated with fitness in childhood (Magnússon et al., 2008) and throughout life (Augste, Lämmle, & Künzell, 2015;Lämmle, Worth, & Bös, 2012), therefore these factors should be taken into account in studies examining children's fitness levels. Although studies focusing on preschool children (aged 3-5 years old) are scarce, research suggests that high levels of physical fitness at these early ages are associated with better body composition (Henriksson et al., 2016;Martinez-Tellez et al., 2016;Niederer et al., 2012), higher scores for cognitive functions (Lang et al., 2018;Latorre-Román, Mora-López, & García-Pinillos, 2016;Nieto-López et al., 2020) and, in general, higher health-related quality of life levels (Redondo-Tébar et al., 2019).
Given the positive relationship between physical fitness and health at early ages (García-Hermoso et al., 2019;García-Hermoso et al., 2020;Mintjens et al., 2018), the assessment of physical fitness in preschoolers has become highly relevant from clinical, educational, and public health perspectives. However, the assessment of physical fitness is not always feasible in large population-based studies in which time, equipment, facilities, and qualified personnel are very often limited.
The International FItness Scale (IFIS), a short and simple scale available in nine different languages, including Spanish, was originally developed for its use in adolescents from nine European countries in the HELENA study. The IFIS provides a measure of fitness based on the answers to five basic questions about the perceived level of general physical fitness and in each fitness component (compared to friends), with answers based on the 5-point Likert-scale (from very poor = 1 to very good = 5). This scale showed good validity and reliability in this population , as well as in a wide variety of populations, such as young adults (Ortega et al., 2013), older adults (Merellano-Navarro et al., 2017), pregnant women (Romero-Gallardo et al., 2020, women with fibromyalgia (Álvarez-Gallardo et al., 2016), and children (aged 9-12 years) (Sánchez-López et al., 2015) from Spain and South America (De Moraes et al., 2019;Ramírez-Vélez et al., 2017). Moreover, fitness levels in children and adolescents using the IFIS have been shown to be strongly associated with adiposity and cardiovascular risk factors (De Moraes et al., 2019;Ortega et al., 2011;Ortega et al., 2013).
However, to accurately complete a questionnaire, the child must have cognitively reached the level of abstract thinking and be able to conceptualise frequency (Burrows, Martin, & Collins, 2010;Mindell, Coombs, & Stamatakis, 2014). This is not possible in children under 8 years of age (Livingstone & Robson, 2000); thus, it seems necessary to ask parents. However, parental reports also have limitations, as parents may be more prone to social desirability bias than children, as has been described in studies on health habits (De Bourdeaudhuij & Van Oost, 2000).
Although researchers quantify validity and reliability in a variety of ways, criterion validity concerns the agreement between the observed value and the true or criterion value of a measure, and re-test reliability concerns the reproducibility of the observed value when the measurement is repeated; both have been considered the two most important aspects of measurement error in sports medicine and science (Hopkins, 2000). In addition, convergent validity understood as the extent to which two measures of constructs that theoretically should be related are in fact related, may be another measure of the robustness of the results provided by the IFIS scale and enhance confidence that the construct is being captured (Kevin & Andrew, 2012).
Therefore, the aim of the present study was to examine the following: 1) the ability of the IFIS, scored by parents, to accurately classify Spanish children aged 3-5 years according to their objectively measured fitness levels (i.e. criterion validity); 2) the associations of the parent-reported IFIS with abdominal adiposity in preschool children (i.e. convergent validity); and 3) the test-retest reliability of the parent-reported IFIS.

Study design and participants
This study was conducted under the PREFIT project framework (http://profith.ugr.es/prefit). The main objective of this project was to assess physical fitness and anthropometric characteristics in preschoolers from 10 different cities across Spain. The data collection took place from January 2014 to November 2015. The study protocol was approved by the local Review Committee for Research Involving Human Subjects (n•845), in accordance with the Declaration of Helsinki 1961 (and the 2013 revision) (Romero-Gallardo et al., 2020). Parents or legal guardians of all children included in the study provided written informed consent, and children gave their verbal consent to participate.
A total of 4,338 preschoolers and their parents were invited to participate in the PREFIT project. Finally, 3,179 parents agreed to participate in the study (73.7% participation rate). No differences were found between the age, sex and anthropometric variables of children who agreed to participate and those who did not. Finally, parent-reported complete data from 3,051 children (1,445 girls) were obtained.
For the reliability analysis, a subsample of 76 randomly recruited participants (45 girls and 31 boys) from a school in Granada city, not involved in the PREFIT study, was selected. They did not differ in age, sex, or anthropometric variables from children participating in the study.
The parents of these 76 participants successfully completed the IFIS twice (2 weeks apart). The questionnaires were sent to parents through their children in an open envelope. Once completed at home, parents were asked to put it in the envelope, closed it, and handed it to their child's teacher. After that, the teachers were responsible for sending the questionnaires to the members of the research team. The following instructions were sent to parents to answer the questionnaire: "Please mark with an X the option that best describes your child's fitness level (compared to his/her friends). Please answer all the questions and do not leave any blank. Mark only one answer per question".

Parent-reported fitness
Parent-reported fitness was assessed by the IFIS, which was originally validated in European adolescents . The original IFIS consists of a fiveitem Likert-type scale with five response options: very poor (1), poor (2), average (3), good (4) and very good (5). Each item addresses a main self-perceived dimension of fitness (cardiorespiratory fitness, muscular fitness, speed-agility and flexibility), and one item addresses overall fitness (http://profith.ugr.es/IFIS). Taking into account a systematic review  showing that in preschoolers, flexibility is not associated with any health indicator and that balance may be a relevant component during earlier childhood, in the version of the IFIS for preschoolers, we decided to replace the item on flexibility with one on balance.

Objectively measured physical fitness
The physical fitness variables were measured in the schools by experienced researchers under standardized conditions using the PREFIT battery Ortega et al., 2015) as follows: Cardiorespiratory fitness (CRF) was assessed using the adapted version of the preschoolers' 20 m shuttle run test . Participants were required to run between two lines that were 20 m apart while keeping pace with audio signals emitted from a prerecorded CD. The initial speed was 6.5 kmh - (Ortega et al., 2008), which was increased by 0.5 kmh - (Ortega et al., 2008) (1 min equals one stage). Children were encouraged to keep running as long as possible throughout the course of the test, and the test was finished when the child failed to reach the end lines concurrent with the audio signals on two consecutive occasions. The number of laps completed was recorded as an indicator of his or her CRF.
Muscular fitness (MF) was assessed using two tests: 1) the handgrip test (maximum handgrip strength assessment) using the analog version of a TKK dynamometer (TKK 5001, Grip-A, Takei, Tokyo, Japan) with the grip span fixed at 4.0 cm. The children squeezed gradually and continuously for at least 2-3 s, performing the test with the right and left hands in turn (Sanchez-Delgado et al., 2015). Children completed two trials (alternately with both hands) with a short rest period between them. The maximum score in kilograms for each hand was recorded, and the average (in kilograms) of both hands was used in the analysis; 2) the standing broad jump test (lower limb explosive strength assessment): from a starting position immediately behind a line, standing with feet approximately shoulder width apart, the schoolchildren jumped horizontally to achieve maximum distance. The best of three attempts was recorded in centimeters.
Speed/agility was measured using the 4 × 10 shuttle run test in which the child runs as fast as possible from the starting line to the line 10 m away and returns to the starting line, crossing each line with both feet every time. Two evaluators stood at each line, and the preschoolers had to touch the evaluator's hand and return to the starting line as fast as possible. Two attempts were made with an interval of at least five minutes, and only the best mark was used for analysis. The time taken to complete the test was recorded to the nearest tenth of a second. For analyses, this variable was multiplied by −1, as less time represents better results.
Static balance was assessed with the one-leg stance test. The test consisted of standing still on one-leg and bending the other leg at approximately 90°. The beginning of the test starts when one of the legs is no longer in contact with the floor. The children had to maintain the balance position for as long as they could. In accordance with the original protocol, there were no upperlimb movement restrictions. The test finished when the child could not continue in the required position. The children had one attempt with each leg, and the average time was registered in seconds.

Abdominal adiposity variables
Experienced trained nurses and sports science graduates conducted the waist circumference (WC) and height measurements under standardized conditions.
Waist circumference was calculated as the average of two measurements at the end of expiration at the middle point between the iliac crest and costal margin when the child was upright using a meter tape. Thereafter, the waist-to-height ratio was calculated.

Statistical analysis
Descriptive statistics included frequencies of each answer for the five questions on the IFIS by sex. The floor and ceiling effects of each item were evaluated by calculating the proportion of cases with minimum and maximum values, respectively.
Because of the small number of participants at the bottom extreme, the categories were merged as "very poor/poor" for the rest of the analyses, except for the reliability analyses, in which the raw data were used.
Criterion validity. To examine the ability of the IFIS to categorize children correctly into physical fitness levels, we performed analysis of covariance (ANCOVA), controlling for sex, age, and waist-to-height ratio. Objectively measured fitness variables were entered as dependent variables, and parent-reported fitness variables were entered as fixed factors. In addition, ANCOVA models were also used to test differences in the mean scores for the z-score of each physical fitness component. In addition, to measure agreement between categories of parent-reported fitness levels (i.e. "very poor/ poor", "average", "good", and "very good") and objective assessment (according to percentiles, i.e. <P25, P25-P50, P50-P75, >P75), a weighted kappa statistic (Cohen, 1968) was used to measure concordance beyond chance.
Convergent validity. Convergent validity was tested using abdominal obesity indicators (WC and waist-toheight ratio) as criteria, since it is one of the main predictors of cardiometabolic risk and has a close relationship with measured physical fitness in children (Henriksson et al., 2016;Martinez-Tellez et al., 2016). Thus, ANCOVA models controlling for sex and age were used to analyze the mean z-scores for WC and the waist-toheight ratio among categories of parent-reported fitness levels ("very poor/poor", "average", "good" and "very good").
In all ANCOVAs, pairwise posthoc hypotheses were tested using the Bonferroni correction for multiple comparisons.
Analyses were performed in SPSS v. 25 (IBM Corp, Armonk, NY, USA), and the level of significance was set at p < 0.05.

Results
Participants were 4.59 ± 0.88 years, they have a mean BMI of 16.49 ± 1.77 and their mean WC was 53.18 ± 5.07 cm. Compared with girls, boys had higher values of/better performance in body weight, height, CRF, handgrip, standing broad jump, and speedagility. In contrast, girls showed higher values of/ better performance in WC, waist-to-height ratio, and balance. There were no differences in age and BMI (Table S1).
We observed a very low percentage (0.1-2.3%) of participants reporting having a "very poor/poor" fitness level. Approximately 60.0% of parents answered that their children have "good" fitness ( Figure S1).
Criterion validity. Overall, compared with participants reporting "very poor/poor" fitness levels, participants reporting "average", "good", and "very good" CRF, MF, speed-agility and balance had better levels of CRF, MF, speed-agility and balance, respectively (p < 0.001) ( Table 1). Figure S2 shows a dose-response association between parent-reported and measured physical fitness. In addition, the mean z-scores of each measured physical fitness component were significantly higher in preschoolers with a higher parent-reported fitness level. The number of children correctly and incorrectly classified by each method is presented in table 2. The weighted kappa for the concordance between parent-reported and objective assessment was poor k = 0.11 (95% confidence interval-CI-: 0.08-0.14) for cardiorespiratory fitness, k = 0.13 (95% CI: 0.10-0.16) for handgrip strength, k = 0.08 (95% CI: 0.05-0.10) for standing-long jump, k = 0.17 (95% CI: 0.14-0.20) for speed-agility and k = 0.18 (95% CI: 0.15-0.21) for balance. And the percentage of agreement ranged from 79.8 to 82.3%. Bonferroni-adjusted pairwise comparisons: the symbol < in the column 1-2, for instance, indicates a significant difference (P < 0.05) in the direction 1 < 2; ns, non-significant. ‡ The lower the score (time in seconds) the better the performance. Convergent validity. Figure 1 shows the association of parent-reported fitness with WC (panel A) and the waist-to-height ratio (panel B), controlling for age and sex. Overall, the mean scores of abdominal adiposity variables were significantly higher (p < 0.05) in those with lower parent-reported fitness, except for muscular fitness, which had higher mean values in preschoolers classified as "good" or "very good" (p < 0.001).
Reliability. Table 3 displays the test-retest reliability statistics in children from Granada for the five items that compose the IFIS, i.e. overall fitness and the four main fitness components: CRF, MF, speed-agility, and balance. Weighted Kappa ranged from 0.46 (balance) to 0.62 (CRF), and the average weighted Kappa was 0.56.

Discussion
Since fitness at early age predicts fitness levels through adolescence and adulthood (Janz, Dawson, & Mahoney, 2000;Shigaki et al., 2020), validating a short and easy- Figure 1. Means of z-score values for waist circumference (A) and waist-to-height-ratio (B) by self-reported physical fitness categories in preschool children. * P < 0.05 between "Very poor/poor" vs "Good" and "Very good"; # P < 0.05 between "Average" vs "Good" and "Very good". All z-scores were sex and age specifically computed. to-apply instrument seems to be a necessary task. To our knowledge, this is the first study to examine the validity and reliability of the parent-reported IFIS in children aged 3-5 years. These findings suggest that the reliability (test-retest) scores of the parent-reported IFIS are moderate. However, although the convergent validity values are acceptable, the concordance analysis show that criterion validity is poor, which suggest that parents' responses may not be able to correctly classify preschoolers according to their fitness level.
As in other studies in children and adolescents Sánchez-López et al., 2015), the distributions of responses to IFIS questions suggest a "ceiling effect" since a high percentage of parents reported that their children had "good" or "very good" fitness levels. This is not surprising considering that at an early age, health problems are unlikely to have appeared, and parents think that their children are healthy. In addition, it is also interesting that in this study, the highest percentage of responses was in the category of "good", while in a previous study in Spanish children aged 9-12 years (Sánchez-López et al., 2015), the highest percentage of responses was in the "very good" category, which suggests that children tend to overestimate their fitness relative to parental perception. However, more studies are necessary to examine this issue in depth.
Given the low number of parents who indicated "very poor" levels of physical fitness (0.1%), the IFIS does not allow the identification of preschoolers with very poor fitness levels. It is likely hard for parents to admit that their children have poor fitness, perhaps due to a social desirability bias (Kristiansen & Harding, 1984) since when they rate their children's fitness level as very low, they feel that indirectly, they are recognizing that they are not doing enough to improve it. Although parents answered the questionnaire confidentially, it is likely that they felt the risk of being identified and judged. On the other hand, parents were informed that they were participating in a study on the importance of physical fitness in childhood, so it seems logical that in their response's fitness levels were overestimated and this could be the reason why only a small percentage of parents marked the "very poor" option. Also, parents may not be fully aware of their children's fitness level, probably due to a lack of knowledge about what optimal or poor fitness means.

Validity and reliability of the International fItness scale
Consistent with previous studies (Ortega et al., 2013;Ramírez-Vélez et al., 2017;Sánchez-López et al., 2015) and with the original validation study of the IFIS , in the current study, it is observed acceptable agreement between parent-reported and measured fitness in preschoolers in the "average", "good" and "very good" categories using ANCOVA. However, the parent-report IFIS was not a valid tool to detect those preschoolers who had a low or very low level of fitness. Since a low fitness level is not recognized by parents, it seems necessary to calibrate the scale in future research. A potential strategy to do this could be to reword the response options into the following categories: Very poor/poor (1), Average (2), and Good (3). In addition, special attention should be given to ensure confidentiality and that parents have the knowledge to discriminate among fitness levels of their children, and not to give out information about the researchers' stance on fitness status in children.
Three arguments can be put forward to explain the low agreement the observed categories of fitness levels reported by parents and the objective assessment (concordance analysis): first, the categorization of the objective assessment by quartiles, without considering cutoffs according to clinical criteria could misclassified a non-negligible percentage of individuals. Therefore, the concordance would be higher than in other samples where parents would not report poor fitness levels, but more children would be classified as p < 25 in measured fitness and in the same vein in other categories; second, the high homogeneity of the sample in terms of their fitness levels, as can be seen in table 1, where the ranges of the mean +/-SD intervals of the categories overlap to a large extent, makes it difficult for parents to discriminate among the different categories of fitness; finally, the large number of response options could be another factor that makes it difficult for parents to correctly classify their children, so a smaller number of response options would help parents to identify the physical condition of their children.
In line with previous studies (Ortega et al., 2013;Ramírez-Vélez et al., 2017;Sánchez-López et al., 2015), which have reported strong associations of the IFIS with adiposity and cardiovascular risk factors. These results show that abdominal adiposity is higher in those preschoolers with "very poor/poor" parent-reported fitness levels (CRF, speed/agility, balance, and overall fitness) than in those participants with "good/very good" fitness. These findings suggest that the IFIS scale has acceptable convergent validity for assessing physical fitness in this age group which makes the scale more robust.
In the present study, abdominal obesity was lower in preschoolers with "very poor/poor" parent-reported MF than in preschoolers with "good/very good" MF. However, when WC is expressed relative to height (i.e. as the waist-to-height ratio), this association disappears. As in previous studies Ortega et al., 2013;Sánchez-López et al., 2015), these results might suggest that when parents answer this item on the scale, they are thinking of absolute strength. Several studies observed that children and adolescents with overweight/obesity scored higher on tests requiring strength without involvement of body weight (Artero et al., 2010;Gulías-González et al., 2014). Future researchers should consider the direct association between parent-reported MF and abdominal adiposity found in this study to properly interpret their results.
The test-retest reliability of IFIS items ranged from 0.46 to 0.62 (average weighted Kappa = 0.56 for a two-week interval), which can be considered "moderate" to "good" agreement, supporting the reliability of the scale in preschoolers (Landis & Koch, 1977). Therefore, these findings suggest that this tool could provide similar measures in the same individuals at two different points in time, i.e. it has acceptable replicability, showing that it is slightly affected by memory biases, social desirability and learning biases that could have been sources of variation when parents filled the questionnaires. The reliability of the scale was similar to that of the original version of the IFIS (averaged weighted Kappa = 0.58)  but lower than that shown in other reliability studies in older children and adolescents (De Moraes et al., 2019;Ramírez-Vélez et al., 2017;Sánchez-López et al., 2015).

Limitations and strengths
The present study is of interest for public health since it provides a useful tool to assess physical fitness at a critical stage of life, when it is not possible to objectively evaluate it or when children have difficulties performing the tests correctly due to their level of cognitive and motor development. However, there are some limitations that should be highlighted: (1) the sample included preschool children from a single country, and it is unknown whether this scale would be appropriate for preschoolers from other countries with different characteristics; (2) children physical fitness was evaluated by parent reports rather than by self-reports by the preschoolers. This fact may have affected the results since previous studies have shown low agreement between child self-reports and parent proxy reports when measuring health related behaviours (Koning et al., 2018;Rebholz et al., 2014). Thus, it is debatable whether parents should answer about their children's fitness. Nevertheless, taking into account the cognitive level of children aged 3-5 years, it seems necessary to validate a questionnaire answered by parents when it is not possible to assess the level of fitness objectively; (3) convergent validity was tested using indirect measurements (i.e. WC and waist-toheight ratio), and therefore, seem to be necessary more sophisticated modelling to remove the influence of body mass and adiposity. Furthermore, other factors not assessed in this study, such as physical activity or energy intake, may have influenced the results; (4) although some criticisms about the validity and reliability of the 20 m shuttle run test for estimating aerobic capacity because of it is influenced by the leg and stride length, it is also true that it is most suitable field test for estimating CRF in epidemiological population-based studies, as evidenced that this test has been used in more than 177 studies, accumulating more than 1 million children and adolescents (Lang et al., 2019). Léger et al. (1988) also developed an equation to indirectly estimate the maximal oxygen consumption (VO2max) from the 20 m shuttle run test-Original (Léger et al., 1988). In this study we evaluated CRF using an adapted version of the 20 m shuttle run test, which has been suggested d to be valid and reliable to assess CRF in children under 6 years of age (Cadenas-Sánchez et al., 2014;Mora-Gonzalez et al., 2017); (5) the time interval between the two repeated measures for reliability analysis represents a debatable issue; an interval of two weeks was selected considering the previous literature of similar studies (Artero et al., 2011), and also taking into account that it is sufficient for individuals not to remember their first responses and for physical fitness not to have changed, both conditions that must be considered in test-retest reliability studies; and finally, although handgrip strength has known limitations to assess the strength as a single test, is considered as a practical, feasible and scalable functional measure of general strength for clinical and population-based screening and surveillance (Milliken et al., 2008).
In conclusion, the results of this study suggest that the reliability (test-retest) scores of the parent-reported IFIS are moderately acceptable. However, the agreement between IFIS questionnaire and objectively measured fitness is low, suggesting that parents' perceptions do not seem correctly classify preschoolers on their fitness level.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The PREFIT project takes place thanks to the funding linked to the Ramón y Cajal grant held by Ortega FB (RYC-2011-09011).