Towards a cross-cultural assessment of binge-watching: Psychometric evaluation of the “watching TV series motives” and “binge-watching engagement and symptoms” questionnaires across nine languages

Abstract In view of the growing interest regarding binge-watching (i.e., watching multiple episodes of television (TV) series in a single sitting) research, two measures were developed and validated to assess binge-watching involvement (“Binge-Watching Engagement and Symptoms Questionnaire”, BWESQ) and related motivations (“Watching TV Series Motives Questionnaire”, WTSMQ). To promote international and cross-cultural binge-watching research, the present article reports on the validation of these questionnaires in nine languages (English, French, Spanish, Italian, German, Hungarian, Persian, Arabic, Chinese). Both questionnaires were disseminated, together with additional self-report measures of happiness, psychopathological symptoms, impulsivity and problematic internet use among TV series viewers from a college/university student population (N = 12,616) in 17 countries. Confirmatory factor, measurement invariance and correlational analyses were conducted to establish structural and construct validity. The two questionnaires had good psychometric properties and fit in each language. Equivalence across languages and gender was supported, while construct validity was evidenced by similar patterns of associations with complementary measures of happiness, psychopathological symptoms, impulsivity and problematic internet use. The results support the psychometric validity and utility of the WTSMQ and BWESQ for conducting cross-cultural research on binge-watching.

Towards a cross-cultural assessment of binge-watching: psychometric evaluation of the "Watching TV Series Motives" and "Binge-Watching Engagement and Symptoms" questionnaires across nine languages Viewers of television (TV) series are currently enjoying unprecedented levels of choice and convenience. No longer dependent on linear TV programming, they can now access as many TV series episodes as they want, regardless of time and place, due to the expansion of on-demand viewing services (e.g., Netflix, Hulu, Amazon Prime) widely available on internet-connected devices. In this context, online TV series watching is increasingly becoming a major part of many individuals' daily lives (Deloitte's digital media trends survey 2018, 2019). However, this major shift in TV series viewing patterns has also led to the emergence of binge-watching, which in the absence of a consensual definition, may be referred to as watching multiple episodes of TV series in a single sitting (Exelmans & Van den Bulck, 2017;Flayelle et al., 2020). Binge-watching has evolved into a common practice, especially among young viewers (Exelmans & Van den Bulck, 2017;Panda & Pandey, 2017;Spangler, 2016;YouGov Omnibus, 2017): recent market reports revealed binge-watching habits among 91% of 14-to 20-year-old and 86% of 21-to 34-year-old individuals (Deloitte's digital media trends survey, 2018).
Their construct validity was reflected in shared positive relationships, as well as associations with supplementary measures of affect and problematic internet use, attesting to the discriminatory ability of the BWESQ in distinguishing high (but healthy) involvement from problematic involvement in binge-watching. Building on the strength of this psychometric validation as well as a firm anchoring in prior phenomenological knowledge of bingewatching, the WTSMQ and BWESQ therefore appear valid and reliable assessment instruments, that are particularly relevant for developing knowledge about binge-watching.
On the one hand, the WTSMQ may facilitate additional research into key determinants of and motives for binge-watching. On the other hand, by avoiding a priori consideration of bingewatching as an addictive disorder while acknowledging elevated involvement in itself, the BWESQ allows problem binge-watching research to move forward without passionate watching of TV series being inappropriately pathologized.
Nevertheless, given the widespread availability of on-demand viewing and online streaming technology (e.g., Netflix, the leading service in this area, currently reaches over 190 countries with 167 million subscribers worldwide; Netflix Media Center, 2020), the investigation of binge-watching should also consider cross-cultural factors, using measurement invariant assessment instruments to integrate and compare findings. The aim of the current study was, therefore, to test the psychometric properties of the WTSMQ and BWESQ across nine languages (i.e., Spanish, French, English, Hungarian, Italian, German, Arabic, Persian, and Chinese) in a large international sample of viewers of TV series, and to examine their measurement equivalence according to language and gender. The general assumption underlying this research effort was that both measures would operate similarly across cultures represented in this study. Additionally, drawing on the known correlates of binge-watching (i.e., diverse mental health issues, poor self-control) and the proposal that binge-watching may be problematic, relationships with relevant independent measures (e.g., self-reported happiness, psychopathological symptoms, impulsivity and problematic internet use) were investigated to assess construct validity in the nine translated versions.

Participants and procedure
An online survey was disseminated mainly among a college/university student population (N = 12,616) across seventeen countries and nine languages: Spanish (n = 3,312), French (n = 3,088), English (n = 2,580), Hungarian (n = 777), Italian (n = 673), German (n = 652), Arabic (n = 540), Persian (n = 512), and Chinese (n = 482). The respondents' countries of residence for each sub-sample are shown in Table 1, and their sociodemographic characteristics are reported in Table 2. Following an identical structure across languages, the online survey successively included: (1) a short demographic questionnaire and questions about TV series watching behaviors (i.e., viewing frequency, average time spent watching during a typical working day/day off, number of episodes usually watched in one viewing session); (2) the "Watching TV Series Motives Questionnaire" and the "Binge-Watching Engagement and Symptoms Questionnaire" (WTSMQ and BWESQ; Flayelle, Canale et al., 2019); (3) the "Subjective Happiness Scale" (SHS; Lyubomirsky & Lepper, 1999); (4) the "Brief Symptom Inventory" (BSI-18;Derogatis, 2001); (5) the "Short Impulsive Behavior Scale" (s-UPPS-P; Billieux et al., 2012); and (6) the "Compulsive Internet Use Scale" (CIUS; Meerkerk, Van Den Eijnden, Vermulst, & Garretsen, 2009). The original validated French versions of the WTSMQ and BWESQ were first translated into English, in accordance with the conventional translation and back-translation procedure (Beaton, Bombardier, Guillemin, & Ferraz, 2000), and all discrepancies 1 that emerged from the comparison between the backtranslated and initial French versions were deliberated (between the first and last authors of this study and the French-English translator) until optimal agreement was found. The English versions of both scales were then shared with each national coordinator who replicated the same standardized process with the help of bilingual translators on site to adapt them into the remaining languages. The majority of the additional validated questionnaires included in the survey were already available in all languages and, if not, another round of translation 2 was conducted by the local investigator.
All language-specific surveys were hosted on the same online platform (Qualtrics) and each national coordinator was responsible for distributing them in their respective academic environments (e.g., through advertisements during lectures, emails to students, announcements among university research participant pools and university social networks) 3 .
Data were collected between May 2018 and January 2019. Inclusion criteria were identical to those applied in the initial validation study (Flayelle, Canale et al., 2019): being at least 18 years of age, being fluent in the targeted language and having watched TV series episodes on a regular basis or more intensively (several episodes in one session) on DVD, computers, digital platforms or streaming devices, over the last six months. Participants provided informed consent before completing the survey with an average response time of 20 minutes.
Although the online survey participation was entirely voluntary, some study sites (Australia, South Africa, and the United States) provided participants with incentives (course credits or prize drawing) to boost participation rates. Anonymity and confidentiality were ensured throughout the survey completion as no data allowing the identification of participants were collected (e.g., internet protocol [IP] address), with the sole exception of email addresses when incentives were put in place. In such cases, the email contact list was only used for the draw purpose or the attribution of academic credits. This study obtained approval from the Ethics Review Panel 4 of the University of Luxembourg in addition to receiving clearance from the local Institutional Review Boards of some partner universities (those in Australia, Egypt, Hungary, South Africa, the United Kingdom, and the United States).
[INSERT The WTSMQ (Flayelle, Canale et al., 2019) is a 22-item scale assessing TV series watching motivations with four core dimensions: social (e.g., "I watch TV series to relate to others more easily, because TV series give me something to discuss."), emotional enhancement (e.g., "I watch TV series to be captivated and experience extraordinary adventures by proxy."), enrichment (e.g., "I watch TV series to develop my personality and broaden my views."), and coping/escapism (e.g., "I watch TV series to escape reality and seek shelter in fictional worlds."). Items are scored on a 4-point Likert scale ranging from 1 (not at all) to 4 (to a great extent), with an average score calculated for each subscale. The internal consistencies for all language-specific samples are presented in the following results section.

Binge-Watching Engagement and Symptoms Questionnaire (BWESQ)
The BWESQ (Flayelle, Canale et al., 2019) is a 40-item scale assessing bingewatching engagement and features of problematic binge-watching. The questionnaire consists of seven scales: engagement (e.g., "Watching TV series is one of my favorite hobbies."), positive emotions (e.g., "Watching TV series is a cause for joy and enthusiasm in my life."), pleasure preservation (e.g., "I worry about getting spoiled."), desire/savoring (e.g., "I look forward to the moment I will be able to see a new episode of my favorite TV series."), bingewatching (e.g., "When an episode comes to an end, and because I want to know what happens next, I often feel an irresistible tension that makes me push through the next episode."), dependency (e.g., "I get tense, irritated or agitated when I can't watch my favorite TV series."), and loss of control (e.g., "I sometimes try not to spend as much time watching TV series, but I fail every time."). Items are scored on a 4-point Likert scale ranging from 1 (strongly disagree) to 4 (strongly agree), with an average score calculated for each subscale.
The internal consistencies for all language-specific samples are presented in the following results section.

Subjective Happiness Scale (SHS)
The SHS (original English version; Lyubomirsky & Lepper, 1999) is a 4-item measure of global self-report happiness with respondents rating the extent to which they feel happy and unhappy (e.g., "In general, I consider myself a very happy person."). Participants evaluated each item on a 7-point rating scale, a mean total score (ranging from 1 to 7) being then computed. The internal consistency of the SHS ranged from .65 (Chinese version) to .88 (German version).

Brief Symptom Inventory-18 (BSI-18)
The BSI-18 (original English version; Derogatis, 2001) assesses general psychological distress with 18 descriptions of physical and emotional complaints distributed over three facets: depression (e.g., "Feeling no interest in things."), anxiety (e.g., "Feeling tense."), and somatization (e.g., "Trouble getting breath."). Respondents have to specify on a scale from 0 (not at all) to 4 (very much) to what extent they are troubled by such experiences. A total score is computed for each of the three subscales. The internal consistencies for all languagespecific samples were high, ranging from .76 (Persian version; somatization) to .89 (Spanish version; depression).

Short Impulsive Behavior Scale (s-UPPS-P)
The s-UPPS-P (original French version; Billieux et al., 2012) is a 20-item scale evaluating five facets of impulsivity: negative urgency (e.g., "When I am upset I often act without thinking."), positive urgency (e.g., "When I am really excited, I tend not to think on the consequences of my actions."), lack of premeditation (e.g., "I usually think carefully before doing anything." − the item is reverse scored), lack of perseverance (e.g., "I generally like to see things through to the end."), and sensation-seeking (e.g., "I sometimes like doing things that are a bit frightening."). Items are scored on a 4-point Likert scale ranging from 1 (strongly agree) to 4 (strongly disagree). A total score is calculated for each of the five subscales. The internal consistencies of the s-UPPS-P subscales ranged from .60 (German version; positive urgency) to .92 (Italian version; lack of perseverance).

Compulsive Internet Use Scale (CIUS)
The CIUS (original English version;Meerkerk et al., 2009) is a 14-item scale assessing problematic internet use on five scales: loss of control (e.g., "Do you find it difficult to stop using the internet when you are online?"), preoccupation (e.g., "Do you think about the internet, even when not online?"), withdrawal symptoms (e.g., "Do you feel restless, frustrated, or irritated when you cannot use the internet?"), coping or mood modification (e.g., "Do you go on the internet when you are feeling down?"), and conflict [e.g., "Do you neglect your daily obligations (work, school, or family life) because you prefer to go on the internet?"]. Items are scored on a 5-point scale ranging from 0 (never) to 4 (very often), and are summed to yield a total single score. Internal consistencies were high across all languagespecific samples, ranging between .86 (Arabic version) and .93 (Spanish version).

Statistical analyses
For data analyses, only full sets of responses 5 were explored, explaining sample size variations within the same language-based sample. In a first step, descriptive statistics concerning sociodemographic characteristics and TV series viewing patterns were computed to compile a profile of the whole and individual samples using SPSS statistical package (version 24.0). Confirmatory factor analyses (CFAs) were then conducted for each languagespecific sample, as well as for the overall sample to examine the adequacy of fit of the 4factor and 7-factor models derived from the initial WTSMQ and BWESQ validation (Flayelle, Canale et al., 2019). The software used to perform these analyses was EQS 6.4 (Bentler, 2006 (Kline, 2015;Hooper, Coughlan, & Mullen, 2008) to respect original factorial integrity of both scales and to ensure the comparability between countries, we did not apply any modification to the models based on modification indices, even when minor changes (e.g., correlations between error terms) significantly increased the models' fit. Goodness of fit for the CFA models was assessed through the following indices: the root mean square error of approximation (RMSEA), the comparative and incremental fit indices (CFI and IFI, respectively), and the standardized root mean square residual (SRMR). An excellent model fit was identified when the CFI and the IFI were ≥ .95, the RMSEA ≤ .05, and the SRMR ≤ .05 (Bagozzi & Yi, 2011;Schermelleh-Engel & Müller, 2003). Using less restrictive criteria, values ≥ .90 for the CFI and the IFI, ≤ .08 for the RMSEA, and ≤ .10 for the SRMR were considered acceptable (Hooper et al., 2008). For the sake of transparency, Satorra-Bentler chi-square (X 2 ), general model significance (p), and relative chi-square (X 2 /df) were reported; however, given that X 2 is highly sensitive to sample size (Jöreskog & Sörbom, 1993;Markland, 2007), which in our study far exceeds the standards required for conducting this type of analysis (Hair, Black, & Babin, 2010), these indices were not employed to assess the adequacy of the CFA models.
To assess whether the factor structures of the WTSMQ and BWESQ were valid for their use across different languages and in both genders 6 , multi-group CFAs according to language and gender were conducted. Specifically, we tested four levels of measurement invariance: 1) configural (test whether items load on the same factor across groups), 2) metric (test whether item factorial loadings are equal across groups), 3) scalar (test whether item intercepts are equal across groups) and 4) error variance invariance (test whether items measurement error are equal across groups). The adequacy of the increasingly constrained models was assessed through the difference between pairs of nested models (△) in the RMSEA, CFI and SRMR. A change ≥ .01 in the CFI, ≥ .015 in the RMSEA, and ≥ .03 in the SRMR indicates a significant decrease in the model fit when testing for measurement invariance (Chen, 2007). This procedure was also used to assess the adequacy of merging into a single dataset the data obtained in different countries for the same language (these results can be found in Supplemental Tables 2 and 3 at: https://osf.io/pxzw8/), a procedure that was performed before conducting the individual CFAs in each language-based dataset.
Reliability of the WTSMQ and BWESQ total scores and factors was assessed through the ordinal Cronbach's alpha (α) and the McDonald's omega (ω). Both indices were calculated using the R package "userfriendlyscience" (Peters, 2014). According to the criteria proposed by Hunsley and Mash (2008), reliability indices between .70 and .79 were considered appropriate, between .80 and .89 good, and ≥ .90 excellent. Finally, the construct validity of the WTSMQ and the BWESQ was appraised by investigating their relationships with age and SHS, BSI-18, s-UPPS-P and CIUS scores across all samples by means of Spearman's correlational analyses 7 , while Pearson point-biserial correlations were used to explore links with gender 8 . To account for multiple comparisons, the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) was also performed to hold the false discovery rate at 5% to mitigate against Type I errors.

Descriptive statistics
TV-series-watching characteristics and average scores for all questionnaire study variables are reported in Table 3.

Watching TV Series Motives Questionnaire (WTSMQ) Structural analysis and measurement invariance across language and gender
The adequacy of the four-factor model from the preliminary WTSMQ validation was tested through CFA. This model proposes that the 22 items comprising this scale may be grouped into four correlated first-order factors (for a comprehensive description of the factorial structure and items distribution, see Flayelle, Canale et al., 2019). Given the confirmatory nature of this study, other competing models were not tested (e.g., unifactorial models, second-order factors). Results from individual CFAs for each language and across all samples are reported in Table 4. As expected, given the datasets' sample sizes, the Satorra-Bentler X 2 value of significance did not exceed the .05 value to consider the models' fit as satisfactory. In addition, the CFI and IFI were consistently under the .90 threshold in all the assessed models, except for the Arabic sample and the whole dataset, in which both indices were near an acceptable value (.89). As for the X 2 , CFI and IFI are sensitive to sample size (Rigdon, 1996), as well as to the item response scale (in particular, ordered categorical answer scales; Finney & DiStefano, 2013). As a result, Rigdon (1996) advised that the CFI is better suited to assess the adequacy of exploratory research designs (i.e., studies comprising small sample sizes) whereas alternative indices such as the RMSEA are better suited to confirmatory contexts (i.e., studies comprising large samples). Furthermore, Kenny and McCoach (2003) argue that the CFI tends to deteriorate in models comprising a large number of variables and indicators, especially for correctly specified models (note that the models described in this paper for the WTSMQ and BWESQ comprise 203 and 719 df respectively).
In contrast, the RMSEA consistently demonstrates an opposite pattern: i.e., a systematic decrease in models comprising an increasing number of variables (Kenny & McCoach, 2003).
Given these limitations, we analysed the goodness of fit of our CFA models by relying on the recommendation made by Kenny and McCoach (2003), who suggest that complex models involving lower Tucker-Lewis index (TLI) and CFI values give no real cause for concern insofar as the RMSEA seems better. In our CFA models, the RMSEA and the SRMR were below the thresholds of .08 and .10 in all the language-based datasets as well as in the whole sample. The best adjustment according to these indices was obtained for the whole sample (RMSEA = .060; SRMR = .051) whereas the worst was obtained for the Persian dataset (RMSEA and SRMR of .079).

[INSERT TABLE 4 HERE]
To test measurement invariance of the WTSMQ according to language and gender, we conducted a series of multi-group CFAs. As displayed in Table 5, language and gender configural invariance of the WTSQM was supported (RMSEA = .065; SRMR = .067 [according to language]; RMSEA = .060; SRMR = .051 [according to gender]), so we subsequently estimated models with increasing levels of constraints to test higher levels of invariance. Regarding metric invariance, changes in the RMSEA and SRMR did not show a significant worsening in the model fit neither for language (△RMSEA = .001; △SRMR = .010) nor for gender invariance (△RMSEA = .001; △SRMR = .005). Similarly, the models' fit did not significantly decrease when subsequent levels of gender invariance were tested (△ in RMSEA and SRMR were always below .015 and .03, respectively), thus supporting a complete equivalence of the WTSMQ in males and females. However, the significant △ in SRMR when scalar and error invariance according to language was tested (.117 and .116) suggested the presence of differences at these levels of measurement according to the language of administration.
For language (not for gender) invariance, values for the △ in CFI exceeded the threshold of .015 (△CFI of .017, .012, and .022 for metric, scalar and error invariance).
However, following the same approach as individual CFAs, this CFI-based index was not considered to assess the adequacy of the invariance models.

Internal consistency
Reliability indices for the WTSMQ total score and factors are displayed in Table 6.
Few differences between ordinal Cronbach's alpha (α) and McDonald's omega (ω) were observed. Convergence between both indices was considered as a good indicator of scale reliability under different conditions (Zinbarg, Revelle, Yovel, & Li, 2005). For the whole sample as well as for the majority of the different language-based samples, both indices clearly exceed the criterion of .70 established by Hunsley and Mash (2008) to consider the reliability of a scale appropriate. The only exception was found in the Chinese dataset, where reliability for factor 4 was below .70 (α and ω of .60). Reliability for the other language-based datasets and for the whole sample ranged between .71-.92 and .82-.90 respectively, with most values indicating good to excellent scale reliability. Thus, the WTSMQ can be considered a reliable measure in each language-based sample.

Structural analysis and measurement invariance across language and gender
The adequacy of the seven-factor model from the preliminary BWESQ validation was tested through CFA (following a similar data-analytic approach to the one used for the WTSMQ). This model proposes that the 40 items comprising this scale may be grouped into seven correlated first-order factors As displayed in Table 4, goodness of fit indices for the BWESQ individual CFAs were acceptable for all the language-based dataset (RMSEA ranging between .056-.062 and SRMR ranging between .057-.074) and in the whole sample (RMSEA = .059; SRMR = .063). Consistent with our expectations that the low CFI and IFI values were linked to the degree of complexity of our CFA models (in terms of number of indicators and latent variables) and not to a truly poor fitting factorial structure, we observed a significant decrease of these indices in the results for this scale (note that the BWESQ has 516 df more than previously); conversely, results for the RMSEA are slightly better (the tendency documented by Kenny and McCoach in increasingly complex models; Kenny & McCoach, 2003).
Results from measurement invariance of the BWESQ across languages and gender are displayed in Table 5. Results are notably similar to those reported for the WTSMQ.
The small changes in the fit indices at the next steps also supported metric invariance according to language (△RMSEA < .000; △SRMR = .012) and gender (△RMSEA = .001; △SRMR = .006). Furthermore, the increase in the level of measurement constraints at the subsequent steps did not result in a significant deterioration of the models' fit (△RMSEA = .001; △SRMR < .000 [scalar invariance]; △RMSEA = .001; △SRMR = .006 [error invariance]) across gender groups, providing strong evidence that the BWESQ operates similarly in males and females. However, scalar invariance according to language was only partially supported (△RMSEA = .007 and △SRMR = .031; i.e., extremely near to .03 threshold) and error variance invariance rejected (△SRMR = .037). Even when △ in CFI was not considered to assess the adequacy of multi-group models, all the values except for the language error variance invariance (△CFI = .011) were below .01, thus supporting different levels of measurement equivalence between the language versions of the BWESQ and in both genders.

Internal consistency
Reliability indices for the BWESQ total score and factors are displayed in Table 6.
Once again, few differences between ordinal Cronbach's alpha (α) and McDonald's omega (ω) were observed, and the majority of reliability values were good to excellent (even better than for the WTSMQ). Apart from the Cronbach's alpha from factor 7 in the Chinese dataset (α = .68; ω = .71) and from factor 5 in the German dataset (α = .67; ω = .71), reliability was always above .70. In particular, reliability for the rest of the language-based datasets and for the whole sample ranged between .72-.97 and .75-.96 respectively, once again with a clear preponderance of values indicating excellent scale reliability. As a result, the BWESQ can be considered a reliable measure for each language-based sample, even more reliable than the WTSMQ (which might be due to the higher number of items comprising each scale as well as the whole scale).

Scale inter-correlations and convergent validity
The correlation ranges obtained among all samples between the WTSMQ and BWESQ with one another, and between each of them with additional measures (i.e., age, gender, and scores on the SHS, BSI-18, s-UPPS-P and CIUS) are reported in Tables 7, 8 and  9. The comprehensive review of language-specific correlations together with the nine language-versions of the WTSMQ and BWESQ can be found at: https://osf.io/pxzw8/.
On the whole, positive relationships emerged in all samples between the various subscales of the WTSMQ and BWESQ. In this regard, the emotional enhancement and coping-escapism motivations systematically encompassed the largest associations with all BWESQ-related dimensions, with non-problematic binge-watching factors (i.e., engagement, positive emotions, pleasure preservation, desire/savoring) being more strongly related to emotional enhancement, whereas problematic-binge-watching-related facets (i.e., dependency, loss of control) were more strongly connected to coping-escapism.
As for external correlates, although exhibiting a small effect size (Cohen, 1988), what particularly stands out across all languages is a stronger positive association between gender and the coping/escapism motivation. Coping/escapism also consistently presented the strongest small to moderate negative relationships with happiness (i.e., SHS total score), and a similar relationship was observed with dependency in the BWESQ. Similarly, all the BSI-18 domains (i.e., depression, anxiety, somatization) displayed more pronounced small to medium relationships with coping/escapism and dependency, followed by binge-watching and loss of control. In all samples, although small in magnitude, the association between impulsivity and motivations for viewing TV series was higher for coping/escapism with negative urgency, positive urgency, lack of premeditation and lack of perseverance, whereas sensation-seeking was more related to the enrichment motive. Among the BWESQ-related domains, the s-UPPS-P subscales' scores were repeatedly associated to a greater extent (small to medium effects) with problematic binge-watching factors (i.e., binge-watching, dependency, loss of control), with negative urgency and sensation-seeking being more specifically connected to dependency, positive urgency to binge-watching, and both lack of premeditation and lack of perseverance to loss of control. Finally, and concurrent with the afore-mentioned relationships, the CIUS total score was in all instances more strongly related to problematic binge-watching factors (i.e., binge-watching, dependency, loss of control), as well as to the coping/escapism motivation, involving mainly moderate to large positive associations. [INSERT

Discussion
The present study investigated the psychometric properties of the "Watching TV Consistent with the initial validation study (Flayelle, Canale et al., 2019) and with our main hypothesis, the factorial structures of both scales were found to replicate appropriate adjustments across all languages in the light of the fit indices (e.g., RMSEA, SRMR) considered better suited in view of our confirmatory framework and the complexity of the assessed models (Kenny & McCoach, 2003;Rigdon, 1996). As such, the theoretical factor models underlying these two instruments hold across languages/cultures represented in this study. Additionally, overall measurement invariance according to language and gender was supported for both, thus implying that, whichever the language spoken, male and female TV series viewers interpreted the WTSMQ and BWESQ items in a conceptually similar manner.  Whang, Lee, & Chang, 2003;Yee, 2007). In this respect, it is worth noting the stronger association identified across samples between coping/escapism and being female, which is somewhat reminiscent of the higher rates of depression in women (Albert, 2015;Cyranowski, Frank, Young, & Shear, 2000;Nolen-Hoeksema, 1990). Furthermore, other potentially addictive behaviors (e.g., gambling) are more strongly related to negative reinforcement motivations in females as compared to males (Zakiniaeiz & Potenza, 2018).
The current findings therefore suggest problematic binge-watching may involve maladaptive coping or emotion-regulation strategies, as in other potentially addictive behaviors (Flayelle, Maurage et al., 2019a, 2019bRubenking & Bracken, 2018;Tukachinsky & Eyal, 2018).  (Vallerand, 2015;Vallerand et al., 2003), which has emphasized that harmonious passion (i.e., significant involvement performed in harmony with other aspects of one's life) is especially related to adaptive correlates of TV series watching, while obsessive passion (i.e., excessive involvement that generates conflict with other activities) is more specifically linked to maladaptive ones (Orosz, Vallerand, Bőthe, Tóth-Király, & Paskuj, 2016;Tóth-Király et al., 2019). Taken together, the current results emphasize the reliability and validity of the WTSMQ and BWESQ measurement instruments over the nine languages, and provide evidence of their utility for future cross-cultural research on problematic binge-watching that is able to avoid pathologizing such a popular leisure activity.
Several limitations should be underlined. First, from a methodological standpoint, the means employed to collect data varied between sites (notably with some relying on the use of incentives), thereby generating certain gaps in the local sample sizes obtained. Still, no major differences exist as for the models' goodness of fit between the samples where incentives were offered or not. Second, as the data are cross-sectional and self-reported, biases related to social desirability, lack of introspection or memory recall might be present, potentially 3 Note that the study was also advertised in the popular press in France. 4 Project identification code: ERP 18-008. 5 A total number of 14,672 respondents started to fill in the questionnaires, with 73% of them completing the entire survey. 6 Given the very low prevalence of participants having reported "transgender" and "other" about their gender identity, only male and female data were considered in such analyses. 7 Spearman's correlations were used to address non-normal distribution of data. 8 In line with the above-mentioned reason, only two categories of data (i.e., male and female) were included in the correlational analyses.         Note. CFA = confirmatory factor analysis; χ 2 = Satorra-Bentler chi-square; df = degrees of freedom; χ 2 /df = normed chi-square; RMSEA = root mean square error of approximation; CFI = comparative fit index; IFI = incremental fit index; SRMR = standardized root mean square residual. All models are significant at p < .001. Note. CFA = confirmatory factor analysis; χ 2 = Satorra-Bentler chi-square; df = degrees of freedom; χ 2 /df = normed chi-square; RMSEA = root mean square error of approximation; CFI = comparative fit index; IFI = incremental fit index; SRMR = standardized root mean square residual; △ RMSEA = change in RMSEA compared with the previous model (expressed in absolute values); △ CFI = change in CFI compared with the previous model (expressed in absolute values); △ SRMR = change in SRMR compared with the previous model (expressed in absolute values). All models are significant at p < .001.