Evaluating Different i*-based Approaches for Selecting Functional Requirements while Balancing and Optimizing Non-Functional Requirements: A Controlled Experiment

Context : A relevant question in requirements engineering is which set of functional requirements (FR) to prioritize and implement, while keeping non-functional requirements (NFR) balanced and optimized. Objective : We aim to provide empirical evidence that requirement engineers may perform better at the task of selecting FRs while optimizing and balancing NFRs using an alternative (automated) i* post-processed model, compared to the original i* model. Method : We performed a controlled experiment, designed to compare the original i* graphical notation, with our post-processed i* visualizations based on Pareto efficiency (a tabular and a radar chart visualization). Our experiment consisted of solving different exercises of various complexity for selecting FRs while balancing NFR. We considered the efficiency (time spent to correctly answer exercises), and the effectiveness (regarding time: time spent to solve exercises, independent of correctness; and regarding correctness of the answer, independent of time). Results : The efficiency analysis shows it is 3.51 times more likely to solve exercises correctly with our tabular and radar chart visualizations than with i*. Actually, i* was the most time-consuming (effectiveness regarding time), had a lower number of correct answers (effectiveness regarding correctness), and was affected by complexity. Visual or textual preference of the subjects had no effect on the score. Beginners took more time to solve exercises than experts if i* is used (no distinction if our Pareto-based visualizations are used). Conclusion : For complex model instances, the Pareto front based tabular visualization results in more correct answers, compared to radar chart visualization. When we consider effectiveness regarding time, the i* graphical notation is the most time consuming visualization, independent of the complexity of the exercise. Finally, regarding efficiency, subjects consume less time when


Introduction
Several studies have shown that effective requirements engineering (RE) is a critical success-factor in software projects, e.g. (Verner et al. 2005;Nasir et al., 2011). Generally, two types of requirements are discerned: functional requirements (FRs) that describe the system services, behaviour or functions to be provided, and non-functional requirements (NFRs), which include (quality) attributes or constraints on the application to build or in the development process (Glinz. 2007). Taking NFRs into account while elaborating the FRs from the early design phases significantly improves the end-user satisfaction (Ameller et al. 2010). Approaches that incorporate goal-oriented techniques, such as i* (Yu, 1997) used as a framework in this article, are an ideal candidate for this purpose, as they explicitly represent FRs (as tasks), and NFRs (as softgoals), and allow to denote the impact of FRs on NFRs (using contribution links). Like other modelling languages, i* has an accompanying graphical notation, which enables modellers to easily create and communicate requirement models, and gain an overall insight in the requirements of the system. Despite the importance of the graphical notation, and its effect on model interpretations (e.g., Nordbotten and Crosby 1999), very few empirical evidence is available regarding their efficiency and effectiveness (see Related Work), nor works comparing different visualizations that consider different purposes to interpret a model 1 .
A pertinent challenge in requirements engineering is to identify an optimal subset of all identified FRs to implement, within the scope of available resources, that satisfy the demand of customers (Zhang et al. 2008). This requires requirements engineers to correctly interpret and compare models varying in the set of FRs they define, while maintaining a balanced satisfaction of NFRs according to priorities determined by the user. This is important in several contexts: during the requirement specification and negotiation phase, where requirements are agreed 1 According to (Atkinson & Kuhne, 2003), models have two main characteristics: on one hand, concepts available for creating models and the rules governing their use (i.e., abstract syntax), and on the other hand the notation to use in depicting models (i.e., concrete syntax). For the sake of clarity, from now on, in this paper we use "model" when we refer to the abstract syntax of models, while "visualization" is used when we refer to concrete syntax of models.
upon between developers and clients within time and budgetary constraints, and ensuring compliance with the priorities of the clients in terms of NFRs; during the requirements elicitation and analysis process, when FR alternatives need to be compared and decided upon, with respect to NFRs; and in incremental development, such as according to the SCRUM Agile methodology (Schwaber and Beedle 2002), where subsets of requirements are chosen in each iteration ("sprint") to implement; when conceiving a new release of existing software, where a list of new features (FRs) needs to be chosen to include in the next release.
To tackle this challenge, we introduced in our previous work (Aguilar et al. 2011) a method based on Pareto efficiency (Szidarovszky et al. 1986), which postprocesses, in an automatable way, an i* model and produces two Pareto-front based visualizations: a tabular and a radar chart. This helps designers to make informed decisions by understanding the trade-offs that are necessary to obtain a well-balanced, optimized NFRs satisfaction, and allow them to more easily prioritize FRs. Recognizing the importance of model visualization (Nordbotten and Crosby, 1999), and considering that the original i* graphical notation was not developed with comparing and balancing alternative requirements specifications in mind, we developed two custom visualizations to support our Pareto efficiencybased model: a tabular (Aguilar et al. 2011) and radar chart (Aguilar et al. 2012) visualizations, that both capture NFR optimization while allowing to easily compare different subsets of FRs to implement.
In this article, we shortly recap our Pareto efficiency-based RE approach to compare subsets of FRs while balancing NFRs, and present an extensive experimental evaluation of the original i* graphical notation (based on the original i* model) and two novel visualizations (based on our i* post-processed model) (tabular and radar chart visualization). The objective is to determine which visualization is better (regarding efficiency and effectiveness) under which conditions, and provide supporting empirical evidence. We hereby focus on efficiency (time spent to correctly answer solve a problem), and the effectiveness (regarding time: time spent to solve a problem, either correct, partially correct or wrong; and correctness of the solution, independent of time) of these visualizations when being used by the designers. The evaluation was set up as a controlled experiment consisting of a set of modelling exercises with two levels of complexity. The level of expertise of the designers participating in the experiment was heterogeneous (i.e. beginners and experts in goal-oriented modelling).
Moreover, due to the different types of visualizations (i.e., graphical and tabular), it is important to study the relation between the learning style of the subjects and the notation used. We have determined the learning style of the subjects by performing a Felder test (Felder and Silverman, 1988).
Specifically The remainder of this paper is structured as follows: Section 2 describes related work. Section 3 shortly describes the three types of visualizations evaluated by means of an example. Section 4, describes the experimental methodology to validate our hypotheses. The analysis of the data obtained is shown in Section 5.
Section 6 presents the threats to validity of the experiments. In section 7 the results obtained are discussed. Finally, the conclusion and future work are presented in Section 8.

Related work
Requirements Engineering (RE) approaches include mechanisms to help designers to understand the trade-offs that are necessary to prioritize FRs while optimizing NFRs satisfaction.
Regarding FR prioritization, in (Salado and Nilchiani, 2015), the authors propose an Adaptive Requirements Prioritization (ARP) method that improves decision making between conflicting functional requirements using the principles of multidimensionality. Its efficiency is evaluated using Monte Carlo simulations for a variety of priority dimensions and priority levels. Bridging simulation and realworld experiments, an interesting work is the one proposed in (Benestad and Hannay, 2012), where an artificial and a field experiment are performed using the same prioritization techniques to assess whether they affect stakeholders' selection of software product features. Furthermore, in (Duan et al, 2009), the authors present an approach for automating a significant part of the prioritization process. The method applies data-mining and machine learning techniques to prioritize functional requirements according to stakeholders' interests, business goals, and cross-cutting concerns such as security or performance requirements.
Recent efforts in order to simplify the prioritization process are focused in an algorithm to prioritizing FRs in incremental software development model according to dependency relationships between requirements (Alzyoudi et al, 2015). In (Jackson, 1999) the author addresses the importance of theoretically sound and practical methods for classifying and prioritizing product requirements by developing a structured approach to gather, analyze, and aggregate stakeholder input. In (Shannin and Zairi, 2009) the authors present how the Kano model and a questionnaire are used for classifying and prioritizing customer requirements in the international airlines industry.
There are also efforts that focus on NFRs, and more specifically on optimizing and balancing them. For example, in (Broster and Coombes, 2011) (Douglas, 2010) defined 11 major relationships between FRs and NFRs focused on cost, risk, schedule, communication, and quality perceptions. A dynamic decision-making infrastructure to support both NFRs representation and monitoring, and to reason about the degree of satisfaction during runtime is presented in (Almeida et al, 2015). The infrastructure is composed of an extended feature model aligned with a domain-specific language for representing NFRs to be monitored at runtime; a monitoring infrastructure to continuously assess NFRs at runtime and a flexible decision-making process to select the best available configuration based on the satisfaction degree of the NFRs. Therefore, this work allows to quantify the level of satisfaction with respect to NFRs specification.
In several situations, post-processing of modeled requirements may be beneficial to better interpret the model instances for a particular purpose. This is important in order to know how to visualize, in the best form, the stakeholders' needs to obtain an optimal final product. There are several approaches that post-process requirements models for a particular purpose. E.g. in (Buarque et al 2013) a new approach is introduced, called OOM-NFR, which processes initial requirements expressed in terms of i* diagrams to get OO-Method conceptual models (Pastor et al 2001) that allow designer to easier consider both FRs and NFRs in order to define the appropriate configuration of the application to be generated. Also, in (Abirami et al 2015), the authors present a framework for post-processing of RE models in order to automatically detect and segregate the FRs and NFRs, thus obtaining an improved conceptual model with more information about NFRs in order to be considered in later stage of development (e.g., for defining constraints).
For the sake of NFRs optimization, two important dimensions of requirement engineering are visualization of requirements, as stated by (Pohl, 2013); they form a relevant research topic (see e.g. the survey of (Cooper et al. 2009)). There are some papers that deal with approaches for visualizing NFRs trade-off. In (Rahimi et al. 2014), authors focus on adequately capturing NFRs such as security, performance, and usability, and present a data mining approach for automating the extraction and subsequent modelling and visualization of NFRs from requirement documents. (Zhang et al. 2008) explain, in their position paper, advantages of using Pareto front to identify optimal choices and trade-offs for stakeholders once an initial set of requirements has been gathered. A survey on Pareto-optimal Search-Based Software Engineering is presented in (Sayyad and Ammar 2013).
Finally, (Zhang et al. 2011) highlight the problem for requirement engineers to find a set of requirements that reflect the needs of several different stakeholders, while remaining within budget. The authors introduce and evaluate two multiobjective evolutionary optimization algorithms for the automated analysis of requirements assignments when multiple stakeholders are to be satisfied by a single choice of requirements. In this paper, the authors use radar charts to illustrate the tensions between the stakeholders' competing requirements in the presence of increasing budgetary pressure, i.e., they are used as a visualization mechanism to easily understand trade-offs between budget and satisfaction of stakeholders.
However, although the literature provides a large number of works related to techniques for visualization of requirements, less effort has been reported on empirical evaluation of different visualizations of requirements. Only some works did a limited theoretic analysis using visual notation theory of respectively the KAOS (Matulevičius et al. 2007) and i* (Moody et al. 2009) (Horkoff et al. 2010), the authors developed an approach for visualizing reasoning through i* models in order to help analysts in understanding conflicts among alternatives (e.g., goals with conflicting paths).
These visualization mechanisms were tested with some case studies that suggest that further visualization mechanisms could support analysis. In (Ernst et al. 2006), the authors present graphical and textual annotations in i* diagrams to denote four quality attributes (NFRs): degree of certainty, feasibility, trustability and performance of a goal. For example, the degree of trust is denoted by thickness of delegation links. A theoretical evaluation of this visualization technique is presented. In the context of the object-oriented analysis (OOA) method, (Gemino 2004) performed an empirical validation to assess the effectiveness of animations and narrations to complement textual descriptions and static OOA diagrams when validating requirements, and concluded that in particular the latter might have a positive effect.
In conclusion, while several efforts have been done to improve requirement

Goal oriented Requirements models
In this section, we shortly describe the two i* models used in this article, the original i* model and our post-processed Pareto efficiency-based model, and focus on their visualization that have been evaluated in the experiment, by means of an example: the original i* diagram for the former, and a tabular and radar chart visualization for the latter.

I* modeling
The i* modeling framework is a goal-oriented requirements engineering (GORE) technique that incorporates social analysis by modeling the relationships between different actors. It consists of two models: the strategic dependency (SD) model to describe the dependency relationships among various actors in an organizational context, and the strategic rationale (SR) model, used to describe actor interests and concerns and how they are addressed. The SR model provides a detailed way of modeling internal intentional elements and relationships of each actor.
Intentional relationships are means-end links representing alternative ways for fulfilling goals; task-decomposition links representing the necessary subcomponents for a task to be performed; or contribution links in order to model how an intentional element contributes to the satisfaction or fulfilment of a softgoal (see Figure 1 (B)). Possible labels for a contribution link are "Make", indicating complete satisfaction of a softgoal, "Some+", indicating a strong positive contribution and "Help", indicating a smaller positive contribution; their negative counterparts are "Break", "Some-", "Hurt". Finally, the label "Unknown" indicates an unknown (positive or negative) strength of the contribution.  Figure 2 shows an example of a simple i* diagram from our experiment. This i* diagram describes a system for managing surveys, having a main goal "Survey to be performed", which can be achieved by means of three FRs ("tasks" in i* terminology), in this case navigational requirements: "Interactive interview", "Perform interview", "Private questionnaire". Each of these tasks is further decomposed into sub-tasks and required resources, and affects one or more NFRs.
For example, "Interactive interview" is decomposed into sub-tasks "Establish chat connection" and "Perform interview", it requires the resource "SurveyRepository", and it helps usability and hurts reliability.

3.2.Tabular visualization of Pareto Front
Our Pareto-based approach assists requirement engineers to evaluate and prioritize FRs while NFRs trade-off improves the application's quality. To do so, our approach builds upon the goal-oriented RE approach i*, with the aim of supporting and improving it using Pareto efficiency. The Pareto efficiency algorithm is based on computing the Pareto front, which is useful when there are multiple competing and conflicting objectives that need to be balanced (Sayyad & Ammar 2013). The Pareto efficiency is a notion from economics widely applied to engineering, which can be described as follows: "given a set of individuals, a set of alternative allocations, and a set of allocation-dependent valuations, an allocation A is an improvement over allocation B only if A can make at least one particular valuation better than B, without making any other worse". Intuitively, the Pareto front is the set of allocations that cannot uniformly be improved.
Applying this principle to our setting, the set of individuals refers to the set of  shown as rows. For each configuration, an "I" in a cell means the corresponding FR was implemented, an N means it was not implemented. The cells containing numerical values correspond with the sum of all the contribution links for that NFR in this configuration, where "make" contributes +4, "some+" contributes +2 and "help" contributes +1. Negative contribution links are correspondingly negatively graded. The complete specification of the experiment is available at https://github.com/josezubcoff/soft_expt to allow its replication. Table 1. An example of a tabular chart visualization of a Pareto front model: Complexity "simple" and type "tabular" (corresponding to the i* diagram in Fig. 2).
Config. R0 R1 R2 Reliability Anonymity Usability Furthermore, the tool allows interactively adding/removing Pareto Front configurations to/from the radar chart.

Experiments
In this section, we describe the definition, design, and settings of the controlled experiment we conducted. The context of the controlled experiment is described following the guidelines described in (Kitchenham et al. 2002) and the subjects as proposed by (Höst et al. 2000) to perform empirical studies in Software Engineering.

Experiment definition
The overall aim of our experiments was to compare three different visualization techniques used within goal-oriented modeling to select which FRs to implement, while optimizing NFRs. We considered and compared the following techniques: (i) i* graphical notation, based on the original i* model, (ii) a tabular visualization of FRs and NFRs configurations, based on the i* post-processed Pareto front model, and (iii) a radar chart visualization of these configurations, based on the i* post-processed Pareto front model. For these three techniques, we have tested the efficiency (time to correctly solve a problem), and effectiveness of results according to time (independent of the score) and according to correctness (independently of time), under two different levels of complexity (i.e., simple and complex). In particular, we evaluated the following main hypotheses: • Is there a relation between the complexity of an i* model, the type of visualization, and the correctness of selected configurations? We studied this hypothesis separately by classifying correctness into three levels (i.e., correct, partially correct and incorrect).
• Is there a relation between the complexity of an i* model, the type of visualization and the time required to select a FR configuration? We studied this hypothesis separately for the three different visualizations (i* diagram, tabular and radar chart) and for the different complexities (easy and complex), and subsequently determined if there is any interaction between the type of visualization and the complexity of model instances.
After obtaining the results of the analysis, and due to the different nature of every visualization of the i* model instances (textual or visual), we performed a followup study to verify an additional hypothesis: • Is there a relation between the subject's learning style (textual or visual) and the time required to select a FR configuration? We also determined the subject learning style to see its possible influence in the time required to select a FR configuration within our experiments.
To this aim, we set up a controlled experiment with two fixed factors: the three types of visualizations and two kinds of complexity levels. Specifically, simple and complex model instances differ by the number of intentional elements that directly influence the decision for NFR optimization: softgoals, contribution links, and tasks with outgoing contribution links. All simple model instances follow the same pattern: 3 softgoals, 7 contribution links and 3 tasks with outgoing contribution links (as well as 1 goal, 7 other tasks and 1 resource). For complex model instances, we considered instances that have 5 softgoals, 14 contribution links and 6 tasks with outgoing contribution links (as well as 3 goals, 3 other tasks and 2 resources). The complex model instances of this experiment represent partial cases of real world scenarios.
Consequently, we have developed six exercises to be solved by the subjects of the experiments (one per type of visualization and per level of complexity). For each exercise, we observed two variables: time (measured in seconds) and score (ranging from 0 to 2). Time is measured by using the subjects' response time in accomplishing the required tasks. The time variable can bring us a measure for assessing the efficiency, as wasting more time to correctly solve an exercise indicates a less efficient notation. Score is measured by expert-judging the effectiveness of the subjects' answers: 0 if the result is wrong, 2 if the result is right, and 1 when the given solution contains the right result but it is incomplete.
Subsequently, score serves as a useful measure for analyzing the effectiveness regarding correctness. Finally, we analyzed the time spent for all scores to assess the effectiveness regarding time and the efficiency (considering only correct answers). Consequently, we can assess the efficiency and effectiveness of each visualization by statistically analyzing the time and score variables.
In addition, due to the different nature of the visualization techniques used (textual and visual), we studied the relation of the learning style (textual or visual) of the subject and the type of visualization. To this aim, we performed a Felder test (Felder and Silverman, 1988) to determine the learning style of each subject.
Finally, the experience of the subject dealing with FRs was considered when studying the different types of visualization. As some of the subjects have extensive previous i* experience, and others have little i* experience, the efficiency could show different behavior according to the experience of the subject.

Experiment context
The context of the controlled experiments is described next (following the guidelines described in (Kitchenham et al. 2002).

Subjects
In order to ease the generalization of the results, the subjects are identified. The

Objects
As previously stated, in our experiment we defined six exercises: three having a "simple" complexity level, and three having a "complex" level. Each exercise of each complexity level uses one of the following types of visualization: (i) i* diagram, (ii) tabular visualization of Pareto front, and (iii) radar chart visualization of Pareto front.
Each single exercise contains three questions with the aim of asking the subject which is the best set of FRs to implement, while optimizing certain NFRs. For each exercise, the subject is asked to write down the answers to three questions: the first question asks for the best configuration for satisfying one NFR, the second aims to satisfy two NFRs (according to a specific priority) while the third asks for satisfying three NFRs (according to a specific priority). For example, in the survey system exercise represented in Figure 2, the following three questions are asked: Question 1: which tasks do you need to implement to maximize usability?
Question 2: which tasks do you need to implement to maximize usability and reliability at the same time (equal priority)?
Question 3: which tasks do you need to implement to maximize usability (1st priority), and then reliability and anonymity (both 2nd priority)?
The model instances were different for each type of visualization (i* diagram, tabular Pareto front or radar chart) and level of complexity (simple or complex).
Furthermore, we distributed the experiment's solving sequence randomly. This was necessary to avoid a repeated measure experiment (i.e., subjects learning from a previous model/visualization); however, we ensured that for each level of complexity, the underlying model instances for every visualization (type) had exactly the same difficulty. This was done by assuring that each model had the same amount of FRs, NFRs, tasks, resources, means-end and contribution links and the same hierarchy of tasks. Furthermore, we had an equal amount of positive and negative outgoing contribution links from tasks, and an equal amount of positive and negative incoming contribution links for each softgoal.

Hypothesis formulation
The main objective of our experiments was comparing three different visualizations for optimizing NFRs and how they respond to the complexity.
Therefore, we tested the relation of the visualization used (i.e. i* notation, tabular and radar chart visualization) and the complexity in order to assess the effectiveness regarding correctness of the answers. The null-hypotheses were then formulated for correct answers: -H01: There is no interaction between the type of visualization and the complexity level, for correct answers (score=2).
-H02: There are no differences between the types of visualization, for correct answers (score=2).
-H03: There are no differences between the complexity levels, for correct answers (score=2).
We also proposed to compare the differences of visualization measured according to effectiveness regarding time. The null hypotheses were: -H04: There is no difference in the time spent to identify a set of FRs to implement while optimizing NFRs between type of visualization.
-H05: There is no difference in the time spent to identify a set of FRs to implement while optimizing NFRs between simple and complex levels of complexity. levels, when measuring the time spent.
-H07: There is no interaction between type of visualization, complexity levels and score when measuring the time spent.
Furthermore, we performed a similar analysis on the efficiency where we considered time spent for correct answers. The null hypotheses for this variable were: -H08: There is no difference in the time spent to correctly identify a set of FRs to implement while optimizing NFRs between different types of visualization.
-H09: There is no difference in the time spent to correctly identify a set of FRs to implement while optimizing NFRs between simple and complex levels of complexity.
-H010: There is no interaction between type of visualization and complexity level, when measuring the time spent for the correct answers.
In addition, two further analyses were done to study the impact of the subjects' experience on modeling FRs and the learning style (visual or textual) identified by the Felder test (Felder and Silverman, 1988). To analyze if a relation exists between the learning style of subjects and their performance measured in time spent to solve the exercise, we tested the following hypotheses (addressing both effectiveness regarding time and efficiency): -H011: There is no difference to identify a set of FRs to implement, while optimizing NFRs, in the time spent, for different learning styles of subjects.
-H012: There is no difference in the time spent to correctly identify a set of FRs to implement while optimizing NFRs, for different types of visualization.
-H013: There is no interaction between type of visualization and learning style, when measuring the time required to correctly identify a set of FRs to implement, while optimization NFRs.
Finally, we compared the results obtained from experts and beginners to be able to evaluate the influence of the subjects' experience when testing the effectiveness regarding time of the type of visualization. The formulation was: -H014: There is no difference in time required between experts and beginners when using i*, tabular and radar chart visualization to identify a set of FRs to implement while optimizing NFRs.
If the null hypothesis can be rejected with a low margin of error, we may accept an alternative hypothesis, which admits a positive effect of the type of visualizations and/or complexity/learning-style/experience on the effectiveness/efficiency of the model.

Identification of Main factors and cofactors
In our experiment the main factors were: the type of visualization used to optimize NFRs (i* diagram, tabular and radar chart visualizations) and the complexity of the represented model (easy or complex). We also assessed if there is interaction among the main factors, which is the combined effect of both factors on the dependent variable (time or score). In addition, to better assess the effect of type of visualization it was needed to control other factors (called co-factors) that may have effect on the dependent variables. Those co-factors were the subjects previous i* experience and learning style.

Measurement of dependent variables
We have considered two dependent variables: score and time, as follows: • Time is measured in seconds, by using the subjects' response time in accomplishing the required tasks. To do so, the subjects' starting and end time of each exercise are recorded. Therefore, time is a continuous variable.
• Score is ranging from 0 to 2 as discrete values (categorical variable). Score is measured by expert-judging the correctness of the subjects' answers: o A score of 0 is obtained if the result is wrong o A score of 2 is obtained if the result is right o A score of 1 is obtained when the result contains the right solution but is incomplete

Experimental trials
The experiments are performed using the test subjects. There are six exercises in total, two exercises for each type of visualization (i* diagram, Pareto front based tabular and radar chart), and, for each of them, one with complexity level easy and the other complex. All subjects solve all exercises. In order to avoid learning effects, each exercise is related to a different case study (for more details see Section 6 Threats to Validity).
Before the experiments, subjects were trained on the different approaches.
Specifically, all individuals had previous knowledge on requirements engineering with i*. Regarding the other visualizations (table and radar chart), we did an ad-hoc 30-minutes training session before conducting experiments. Due to the fact that the visualizations are based on the i* diagrams, it is enough time for the subjects to understand them (since all of them know i*). We also gave the subjects detailed instructions related to the tasks to be performed.
Each exercise was done individually. Subjects could take as much time as was needed to solve each exercise (i.e., no fixed time). The subject recorded the starting and ending time for each answer in hours, minutes and seconds, and the answer to the questions, i.e., one or more configurations of FRs.

Data analysis
The diagram in figure 4 presents an overview of the analysis strategy followed.
We have divided the data analysis results in four subsections, results for: score, time, visual-preference and experience. A descriptive analysis is presented in each of the four sections as first step. Afterwards, we include the results of the relevant statistical tests to verify the formulated hypotheses. To analyze effectiveness regarding correctness, after a descriptive analysis, we tested the effect of type of visualization and complexity on the score (Fig. 4)  We have analyzed the effectiveness regarding time by testing the time spent to solve the exercise considering type and complexity as fixed factors in a multifactorial ANOVA. Here, we used all answers, to understand the behavior across the type of visualization between the two levels of complexity. Due to the heterogeneity of variability around the mean time across these factors, a logarithmic transformation was needed to be able to assure the homogeneity of variances. After that analysis, we selected only the correct answers (with score = 2), to go in deep with the efficiency analysis. We used the Tukey HSD test (Jaccard et al. 1984) for the post hoc analysis when needed.
The analysis for the time spent considering only those answers with the maximum score, shows the heteroscedasticity (Fig. 5). The time does not show homogeneity of variances between the type, score and complexity levels (p-value under 0.05 even with square root or logarithmic transformations). In this case, we proceed to analyze the ANOVA setting the significance to 0.01.

Analysis of subject learning style preference
To analyze the subject's textual or visual learning preference, we performed a Felder test, with which we obtain the subject's learning style preference. Then, ANOVA was used to detect if there is any evidence that the relationship between the preference of the subjects and the type of visualization has any influence on the time spent.

Analysis of experience
We tested if the experience of the subjects (with previous knowledge of i*) has any effect on time. We used the experience and type of visualization as fixed factor and the time as a dependent variable. Then, we analyze the experience effect by testing with ANOVA, and the Tukey HSD when a significant difference was found.

Results
We have divided the results into four subsections, focusing respectively on effectiveness, efficiency, visual or textual preference and influence of experience analysis.

Analysis of score
The results for score, which represent effectiveness regarding correctness, are summarized in Table 2 in which the total amounts of correct/incorrect/partiallycorrect answers by type and complexity levels are shown. For example, each participant had to solve 3 questions on easy (complexity) i* graphical notation (type), therefore the total amount of answers for that type of exercise should be 96. However, not all participants answered all questions, and we got 84 answers.
Considering the score results, we applied an analysis strategy separating the score levels in: correct (score = 2), incorrect (score = 0) and partially correct answers (score = 1). From Table 2, we highlight some results: (i) overall, there is a similar amount of correct (64) versus incorrect and partially correct answers (68) for i* graphical notation; more correct answers (80 versus 52) for tabular visualization; less correct answer for radar chart visualization (52 versus 79); ii) i* graphical notation has a larger amount of wrong answers compared to tabular and radar chart visualizations, even for easy model instances; iii) when the model is getting complex, the amount of correct answers for the i* graphical notation decreases sharply (-44%), while only mildly for radar chart visualization (~ -15%) and quite similar for tabular visualization (< -5%); iv) the total amount of partially correct answers is much lower than that of the correct answers for i* graphical notation and tabular visualization; for the radar chart the correct and partially correct answers are more evenly spread (52 and 48 respectively); and v) the Pareto tabular visualization has similar values of correct answer for easy and complex model instances. The dependent variable on ANOVA was the amount of correct answers (score=2) by exercise. ANOVA was not able to detect significant interaction between Type and Complexity (F2,12=0.597; p=0.566) (hence, H01 cannot be rejected).
Subsequently, the differences on the combined effect were not significant. This is mainly due to the high variability in the number of correct answers on i* graphical notation (type of visualization). This behavior can be hiding the tendency shown in Figure 5 where i* graphical notation sharply decreases the number of correct answers on complex models, while there is no clear pattern on the other two types of visualizations.
We found no significant differences in mean of the number of correct answers by  Table 3.  table 3 is -1.4264 and the OR indicates that it is almost 4 times more probable to solve an easy Pareto exercise than an i* complex model, an evident result. The time, considered as covariate in the analysis of effectiveness, has a significant negative effect, which is interpreted as follows: as long as the time increases, the probability of getting correct answers diminishes. As the estimate is close to zero (estimate=-0.0070), this is only a minor behavior, however, it is significant (p-value=0.00049). This behavior is further confirmed in the efficiency analysis. Figure 6: Estimation of probability for score=2 by Complexity and

Type
The interaction plot in figure 6 shows that all three types of model visualization obtain similar probabilities for solving the exercises when the model is easy.
However, when modeling becomes complex, the probability to correctly solve the problem when using the i* graphical notation is halved. Although the radar chart visualization seems to have similar behavior for complex model instances, the probability is much lower and is not significantly different from the probability of solving easy exercises. Finally, the tabular visualization of the Pareto front seems not to be affected by the complexity, because its probability remains high for easy and complex model instances (around 0.73 of probability to correctly solve the exercise).

Analysis of time
The strategy for the analysis of time starts with the descriptive analysis of the time variable followed by the analysis of the effect of type and complexity (fixed factors) on the time. We consider all responses (effectiveness regarding time) and only correct responses (efficiency). Considering the effectiveness regarding time (see Table 4), the i* graphical notation obtained the worst results, both for easy and complex models. On the other hand, radar chart overall performs best regarding time (lowest time), yet for easy models, tabular visualization obtains similar results (note that it is not possible to statistically significantly differentiate between tabular and radar chart visualization for easy models, even though the mean for the former is slightly lower). Interestingly, the complexity of models doesn't seem to significantly affect the time needed to solve exercises for i* graphical notation and radar chart visualization (only minor differences); for tabular visualization on the other hand, the time spent seems to significantly increase as models become more complex.
Finally, as a disclaimer, we must mention that there is a high variability in time to solve the exercise for all visualizations (see standard deviation in Table 4). This variability is higher for i* graphical notation than for the tabular and radar chart visualizations. The variability in time was mainly produced by the time required to solve the exercises using the i* graphical notation (see Table 4 and figures 7 and 8). The lack of normality and homoscedasticity (large dispersion for score=0 and 1, and more concentrated times for score=2) discourages the use of parametric test.
However, ANOVA is robust under the lack of normality, also in presence of heteroscedasticity (Lix et al. 1996), when applied on balanced datasets where the sample size is large enough (n>30), but it is recommended to reduce the significance used to 0.01. The alternative tests do not address the main problem, namely the inequality of variances. Moreover, the heterogeneity can result in a lack of effect detection, while it is important to deal with heterogeneity observed over time, and explain its effects. Given these restrictions, the significance was set to 0.01 and we proceed with the multifactorial ANOVA analysis.
The ANOVA results (Table 5) hint an interaction effect between the type and the complexity (F2,377=5.005; p=0.00716) (hence, H06 can be rejected). The lines of the interaction plot ( Fig. 9) show different trends in time by type and complexity levels. The more complex the model, the more time is required for solving the exercise when using the Pareto table (see Fig.9). When using the other notations, complexity seems to have no effect on the time spent for solving the exercise.  . 9b). In addition, for any score, the i* graphical notation requires more time to solve the same exercise than any other visualization (Fig. 9). Based on the results of the Tukey test, we can thus reject H08: subjects using the i* graphical notation spent (significant) more time (p=0.00000) than radar chart and Pareto and thus H09 cannot be rejected. The interaction between type and complexity has significant effect on time (F2,377=5.005; p=0.00716) (H010 is rejected). Figure 9 shows the time spent considering all data (Fig.9a) and by score (Fig. 9bcd). Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of subject learning style preference
As stated, we also performed a Felder test (Felder and Silverman, 1988) to assess if there is any evidence on the relationship between the preferred learning style of the subjects (visual, textual or balanced) by complexity and the time spent to solve the exercises. Table 6 contains the total amount of correct versus incorrect or partially correct answers by subject learning style preference. We hereby note that not all the subjects performed the Felder test and not all the participant answered all the questions. Note that for our experiment, the Felder test did not reveal any subjects with "textual" learning style preference; only "visual" and "balanced".
Considering the ratio between visual vs balanced subject style preference, both for correct versus incorrect and partially correct answers (Table 6), we observe that for i* graphical notation and for radar chart visualization, the ratio is close to 1 both for correct and incorrect/partially correct answers, while for tabular visualization the ratio shows slightly different behavior for correct (ratio=1.2) versus incorrect and partially correct answer (ratio=0.7) answers. The interaction plot on figure 10 shows similar slopes for visual or balanced subject learning style preference for any type and score. Thus, the subject learning style preference does not seem to have any effect on the efficiency.

Analysis of experience
We analyzed effectiveness regarding time and efficiency based on the subjects experience. The subjects experience was not balanced between the beginner and expert levels, mainly due to the fact that few experts were present among the students, and almost all were beginners. Nevertheless, using the i* graphical notation requires more time to interpret model instances and formulate an answer for beginners than for those that are experts (p-value=4.09e-14) (H014 can be rejected). Radar chart and tabular visualizations of Pareto front have a similar effect on time for experts and beginners (Fig. 11A). Assessing the efficiency, the ANOVA test did not find significant differences on any source of variation. Also, the trend to increase the time to correctly solve the exercise observed for Pareto  There is a high number of outliers for the beginners (Fig. 12). This means that some beginners take a lot of time to solve some questions. Most of those outliers come from i* graphical notations. This is different from the experts' behavior, where the maximum (correct) solving time was lower than 200 seconds. The GLM model (Poisson family) did not find any significant difference for the count of correct answers by experience or type of visualization. The variability on the effectiveness regarding correctness was high across all levels ( Fig. 13) masking any behavior (the relative lower effectiveness regarding correctness on experts was not significant). In this section, we will analyze the threats to validity related to the experiment.
We can categorize them as internal, external, construct and conclusion validity threats.

Internal Validity
In this type of experiments, the main internal validity threat is the learning effect.
This effect appears when performing the experiments consecutively, and subjects learn how to improve their results if they always need to solve the same problem, or always start with the same type of visualization and complexity. In this work, we deal with this effect by not using the same assignment for any exercises (one for each combination of type and complexity) and in addition, by randomly distributing the experiment's solving sequence. Furthermore, subjects were asked to optimize different NFRs, having similar difficulty for each different assignment (within easy/complex).
To avoid the communication between subjects, which could falsify results, they performed all experiments in one run, and having surveillance to make sure there was no communication between subjects.

External Validity
As was described in the experiment's methodology, all the subjects were students from a master in computing engineering, with the same short training in the modelling instruments used in the experiments. Then, the results can be generalized for all graduates. No modelling experience was required. The short training period establishes a common basis suitable for the analysis.
The main external validity threat is the relatively small model instances used in the exercises compared to business scenarios that can result in more complex model instances. However, the complexity level was restricted for the duration of the experiments, which was limited to maximum two hours, not days as in real world cases. Nevertheless, the complex level was always selected with enough level of difficulty to represent partial cases of real world scenarios. In addition, to deal with this external validity threat and to allow detecting the differences in effectiveness or accuracy related to complexity it was taken as a factor in the analysis.

Construct Validity
The validity threats related to the design mainly affect discrete or categorized variables. The relation between factors when considering discrete variables can be analyzed by a Chi square test. However, it is known that this analysis cannot deal with interaction and/or multifactorial analysis and it increasingly rejects the null hypothesis on some circumstances (Bull et al. 1992). To deal with this validity threat a logistic regression was used to explain relationships on possible hidden effects behind multifactorial models on qualitative variables.
Construct validity threats that may be present in this experiment, i.e., interactions between different treatments, were mitigated by a proper design that allowed separating the analysis of the different factors and their interactions. In particular, in each set of exercises, subjects worked on two different model instances to avoid learning effects and to ensure that the differences in instances' complexity would not bias the results (we selected two data instances of comparable complexity at each level, as described in section 4.2.2 Objects).
An important issue is the diversity on the subjects. They were selected from masters in software engineering, researchers and Phd students from three universities. These subjects share the interest of acquiring more and specialized knowledge in the topic of the experiment. Nevertheless, they come from different universities, different basic formation and different business experience. This diversity represents the real-world scenery in software engineering (Briand et al. 2005).

Conclusion Validity
We have presented the results classified by four aspects: effectiveness, efficiency, preference and experience of subjects. We have used the appropriate tests considering the analytical strategy. We presented the suitable model (and its pvalue) for testing differences between means for quantitative dependent variables, and, in case of qualitative dependent variables we used a proper analytical tool (GLMM) for the experiment. We checked the validity of the model instances and the assumption required in each statistical analysis. In addition, we used the descriptive analysis preceding the statistical one. Figures included in this work were selected to improve the comprehension of the experiment results related to each analysis. Thereby, they facilitate demonstrating and reaching a conclusion for each aspect.

Discussion
It is well known that research in a particular field passes through several phases, depending on its maturity (we consider the classification scheme suggested by Wieringa et al (2006) Our results shed light on some important issues. Regarding the original i* graphical notation, with respect to selecting functional requirements while balancing and optimizing non-functional requirements: 1. Considering the effectiveness regarding correctness we have observed a large amount of wrong answers. Additionally, for complex model instances, the amount of answers in general, and the amount of correct answers in particular, decreases.
2. Considering the effectiveness regarding time and efficiency, in general, using the i* graphical notation takes more time to obtain an answer (and also having more extreme values 2 ). Surprisingly, the average time spent decreases as models become more complex. We can explain this unexpected result considering that mostly experts answered the complex model exercises, and beginners, which are generally slower, were not always able to complete the complex exercises. We observed that some beginners take a lot of time to solve some exercises of the experiment.
Both the above observations discourage the use of i* graphical notation in real world cases: these models tend to be rather complex, whereby fewer modelers are able to formulate an answer; modelers on average spend more time formulating an answer; and there generally is a larger amount of incorrect answers.
Regarding the tabular and radar chart visualizations: 1. Considering the effectiveness regarding correctness, complexity does not seem to affect our tabular and radar chart visualization, as each (separately) obtains similar amounts of correct answers for easy and complex model instances. However, tabular visualization got significantly more correct answers than radar chart.
2. Considering the effectiveness regarding time, tabular and radar chart visualizations emerge as the most efficient approaches. The tabular visualization seems to be slightly affected in its efficiency on complex model instances, but the post-hoc test was not able to detect that trend.
Nevertheless, for our experiment, we emphasize that the using radar chart modelers perform ~30% faster (around 30 seconds) on average for complex models, and using tabular visualization around 10% faster, compared to i* graphical notation.
Taking in mind the above facts, on complex model instances, which are especially relevant as real-world cases are indeed complex by nature, the visualization with better effectiveness regarding correctness was the tabular visualization. We can conclude then, that regarding correctness, the original i* graphical notation is not suitable for complex model instances. Nevertheless, although the tabular visualization is the one that obtains more correct answers, it performs worse regarding time on complex model instances.
An important observation in this experiment is the high variability of the i* graphical notation results across all the variables (time and correctness). This variability denotes a large number of outliers: participants spending much more or less time than the average, and a large spread between correct and incorrect answers. Together with the sharp drop of correct answer on complex models, it suggests a lack of confidence of users using the i* graphical notation, and a larger possibility that a particular modeler produces a bad result (in time and/or correctness). We could also not find an improvement with experience: even gaining experience in i* graphical notation, the results are still highly variable, or in other words, training in i* does not remedy the problem of high variability.
Clearly, the high variability detected for i* graphical notation, and the consequences it implies, is an undesirable property in a real word context (i.e., in industry). On the other hand, using Pareto based visualization the results were more consistent, showing less variability, particularly on the correctness. Furthermore, the Pareto front based alternatives tend to performs better even without any previous experience.

Conclusions
Requirements Engineering (RE) methods and frameworks were developed to understand, elaborate, reason about and document requirements. RE methods feature graphical notations (visualizations), which were primarily designed for ease of model construction and interpretability, and model readability. In this article, we argue that different visualizations for different purposes in the RE process might be useful. Particularly, we focus on a crucial step in RE: optimizing NFRs when selecting FRs to implement. For this purpose, we chose the goaloriented RE approach i*, and compare its original graphical notation with two visualizations of our custom developed i* post-processed Pareto front model. performs better (or worse). From this work, we can conclude that, when selecting FRs to implement while optimizing NFRs, the original i* graphical notation is the least adequate visualization method. It performs worse regarding correctness, and scales worse when dealing with complex model instances, compared to Pareto efficiency-based radar chart and tabular visualizations. The latter two have a higher probability than the i* graphical notation to obtain a correct configuration of FR while balancing NFRs, while spending less time. Among them, and for complex model instances, the tabular visualization resulted in more correct answers compared to radar chart visualization.
When considering the time required to solve an exercise, the radar chart visualization scales better with complexity compared to the tabular visualization.
According to our experiments, subjects perform better when using our Paretoefficiency-based visualizations (tabular and radar chart) in efficiency and effectiveness regarding time. Moreover, there is no difference with regards to the visualization preference of the subjects, nor to their previous experience, when considering time spent to solve an exercise.
As future work, we will focus on considering these results to improve visualizations mechanisms of our approach for increasing efficiency and effectiveness of NFR optimization. Also, we would like to replicate these experiments to get more valuable insights.