Comparing candidates’ beliefs and exam performance in speaking tests

The development of a language exam is not a linear process but rather a round cycle in which, by using the test, we obtain information that will in turn be applied to improve each of the steps in the cycle. The goal of our study was to analyse students’ beliefs about their performance in the speaking section of a language proficiency exam and compare them with their actual results in the exam, in order to determine whether their beliefs were based on their actual level of competence or if they were based on other factors, such as anxiety or stress due to the particular characteristics of this section of the exam.


I. INTRODUCTION
The development of a language test is a process that involves several stages, from designing the test and endowing it with adequate contents, to administering the test and analysing the results obtained. However, developing a test is not a linear process but instead a round cycle in which, by using the test, we obtain information that will in turn be applied to improve each of the steps in the cycle.
The goal of our study was to analyse students' beliefs about their performance in the speaking section of a language proficiency exam in relation to other skills, and compare them with their actual results in the exam in order to determine whether their beliefs were based on their actual level of competence or if they were based on other factors, such as anxiety or stress due to the particular characteristics of this section of the exam.
Since the examination of reliability depends upon our ability to distinguish the effects (on test scores) of the abilities we want to measure from the effects of other factors  (Bachman 1990, p.163), being able to differentiate external factors from the actual level of competence of the candidate would necessarily improve test reliability.
Consequently, determining the basis for their beliefs, either factual or self-perceived, would allow us to determine which aspects of the process could be modified to improve the reliability and thus the quality of our exam.

II. STATE OF THE ART
The process of foreign language acquisition has been examined from different points of view -cognitive, psychological, linguistic, pragmatic and cultural, to mention just a few -and the exact nature of the process is still unknown.

Traditional language learning theories focus mainly on the study of what is learned and
what is not learned in a language, explaining both processes by means of the strategies used to acquire knowledge and the reasons for success or failure in acquisition. This approach focuses on learning itself, on specific objectives and on the means used to achieve them and the results obtained. However, it fails to pay attention to the factors surrounding this process, the factors that add complexity to the process and that include not only objective components, but also subjective or external components, which will largely contribute to the final outcome. The starting point is therefore to consider language learning as a broad field in which external factors play an important role and, amongst them, those characteristics that are individual to each student and make their learning unique.
Examining the process of second language acquisition from this broad point of view, there are a number of subjective variables that belong to the students' individual field and that have a significant effect on their learning. This explains the different degrees of success in learning a language achieved by different subjects who follow the same programme and have a comparable intellectual ability. However, it is worth mentioning here that, although social and affective strategies are mentioned in the literature (Dörney 1994;Gardner and Lambert 1959;Hardison, 2014;Horwiz 1995;Jee 2014;Sparks et al. 2011), many of the authors reviewed (Bachman 1990;Bachman and Palmer 1996;Cohen 2003;O'Malley and Chamot 1990) focus primarily on cognitive and However, some studies (Baddeley 2007;Carroll and Sapon 1959;Conway et al. 2007;Pimsleur 1966) attribute this difference in the degree of success to the students' ability, on the cognitive aspect, leaving aside the emotional aspect, i.e. their attitude, motivation and beliefs about their own learning process. However, "if we were to devise theories of second language acquisition or teaching methodologies that were based only on cognitive considerations, we would be omitting the most fundamental side of human behaviour" (Brown 2000: 142). The affective domain is difficult to describe scientifically since it refers to emotion or feeling, yet the emotional side of human behaviour is intrinsically related to the cognitive side and needs to be taken into consideration.
Sustained by developments in the field of foreign language teaching towards studentcentred learning, the study of these factors has become increasingly important together with, more specifically, the study of how students' perceptions and beliefs are a fundamental aspect on their path to learning a language. In fact, Foreign Language Anxiety (FLA) has been defined as a particular type of anxiety occurring specifically in foreign language learning situations, "a distinct complex construct of self-perceptions, beliefs, feelings and behaviours related to classroom language learning arising from the uniqueness of the language learning process", "a phenomenon related to but distinguishable from other specific anxieties" (Horwitz and Cope 1986: 128-129) and it is made up of three principal components: (a) communication apprehension, (b) test anxiety, and (c) fear of negative evaluation.
Motivation has also been considered influential in the degree of success of foreign language students, and this includes both instrumental motivation -the desire to obtain something from studying a second language -and integrative motivation -the desire to integrate into the culture of the second language (Gardner and Lambert 1959). In fact, and although both types of motivation contribute to second language learning success, students who are the most successful are those who are interested in the culture of origin and native speakers and have a desire to integrate into the society in which the language is used (Falk 1978).  Bachman and Palmer (1996) observed two types of variability in students' performance in language tests: (1) variability due to differences between individuals in terms of the language skills, strategies and processes used, as well as personal characteristics such as cultural and emotional differences, etc., and (2) variability due to the different characteristics of the method or tasks used in the test, such as the assessment modes or types of tasks used. According to Dornyei (2009) individual differences should be considered as higher level amalgams or constellations of cognition, affect and motivation that act as "wholes".
As a consequence of this approach, the subjective variables of the acquisition process mentioned above also play an important role in language testing, since the design of a test needs to take into consideration not only the characteristics of the tasks -test format, input provided, time allotted -but also the individual characteristics of the users -the positive or negative emotions or feelings they may have about their learning process, the examiner, the subject or context, the presence or absence of excessive anxiety when faced with the task, their motivation, etc. Accordingly, two aspects need to be taken into consideration simultaneously: (1) the characteristics of the task, which need to reflect the construct of the test and mirror target language use, and (2) the individual characteristics of the learner, which will affect their learning process and therefore their performance in a test situation.
As can be seen from the aforementioned arguments, although it would be desirable that the primary factor in the outcome of a language test were the ability of the test-taker or the adequacy of the test construct and structure of the tasks, in actual fact there are many other variables coming into play, ranging from the context to the individual characteristics of each test-taker.
Such factors become even more relevant in assessing speaking, since speaking in a foreign language is perhaps the most difficult skill to master as it involves a complex process of constructing meaning (Celce-Murcia and Olshtain 2000) which is performed at the same time as the act of speaking and therefore requires the planning and simultaneous monitoring of utterances. In fact, the ability to express oneself orally in a foreign language is a fundamental part of mastering language use and, as mentioned by  (2004), reflects not only our personality but also our self-image and ability to reason: Speaking is also the most difficult language skill to assess reliably. A person's speaking ability is usually judged during a face-to-face interaction, in real time, between an interlocutor and a candidate. The assessor has to make instantaneous judgements about a range of aspects of what is being said, as it is being said. This means that the assessment might depend not only upon which particular features of speech (e.g. pronunciation, accuracy, fluency) the interlocutor pays attention to at any point in time, but upon a host of other factors such as the language level, gender, and status of the interlocutor and the personal characteristics of the interlocutor and candidate. (Luoma 2004: ix).
Furthermore, speaking a language is especially difficult for foreign language learners because effective oral communication requires the ability to use the language appropriately in social interactions (Fulcher 2003).
Traditionally, most students sitting official exams show high levels of stress when dealing with the speaking section of the test and explain their reaction by expressing their doubts about their own speaking ability (Phillips 1992;Stephenson and Hewitt 2001). However, and in light of the above, we believe that the fact that the speaking section of the test causes more stress in students is, in many cases, not because of their ability or lack of it, but because of the construct of speaking mentioned. As Bandura (1997: 37) states: "perceived self-efficacy is not a measure of the skills that one possesses, rather it is a belief about what one can do in the future, and under different conditions, with the skills that one has". Consequently, perceived self-efficacy, the extent of one's belief in one's own ability to reach goals, will probably influence the way people will react in the face of difficulties and, therefore, a more positive perception of one's skills will influence performance in the real world. Real performance in the real world is in turn what performance in an exam situation should be expected to mirror and what exam tasks and context should be expected to elicit. It is interesting to note that the higher the level of language exam, the higher the participation of candidates: 55% of B1 candidates participated in the study, compared to 58% of B2 candidates and 62% of C1 candidates.
CertACLES exams are proficiency exams developed by the Language Centre of the UPV in accordance with the model developed by the Spanish Association of Language Centres in Higher Education (ACLES 2011a, b). ACLES introduced a model for a language examination -the CertACLES model (ACLES 2011b) -that would be followed by all higher education institutions belonging to the organisation and that was intended to allow for the assessment of communicative competence with a standard and comparable framework which all member institutions needed to adhere to. This framework is solid enough to provide for a standard tool for measuring language ability while allowing each individual university to adjust their exam to meet the needs of their environment. Each university is therefore in charge of designing its own individual exams, which have to comply with the framework but have to take into consideration each particular context, not only in terms of test construct and specifications, but also in terms of administration dates and frequencies. CertACLES exams measure the four skills -reading, writing, listening and speaking -and give equal weight to each section. The profile of the candidates was expected to consist mainly of a student population, although students and staff from other universities in the area who do not offer their own language proficiency tests were also expected to take part. To further specify the

III.1.2. Education
The results were as expected given the type of examination and the examining body (a higher education institution), the large majority of participants hold a university degree, and 15% of them have doctoral degrees.   (Gardner and Lambert, 1959), and would thus help predict their degree of success. As we can see from the results, illustrated in Table 3, in 75% of the cases the motivation for taking the exam was instrumental. Only 25% of the candidates showed an integrative motivation and stated that the reason for taking the exam was personal satisfaction, which was assumed to mean travelling to other countries and meeting native-speaking people as well as learning the culture of native-speaking countries. Other 10 5%

III.2. Materials
The main goal of our study, as stated in our introduction, was to analyse students' beliefs about their performance in the speaking section of a language proficiency exam and compare them with their actual results in the exam, in order to determine whether their beliefs were based on their actual level of competence or if they were based on other factors, such as anxiety or stress due to the particular characteristics of this section of the exam. In order to do this, we needed to examine, on the one hand, their feelings with respect to the different sections of the exam in terms of perceived difficulty and candidate anxiety and, on the other hand, the results obtained by the candidates in the actual examination. By looking at the results obtained in the speaking section of the Accordingly, our study would initially be divided into two different steps: (1) analysing students' perceptions of their performance in the exam, and (2) analysing students' actual results in the exam.

III.2.1. Analysing students' perceptions of their performance in the exam:
For the sake of practicality, we decided to use the free tool for generating surveys provided by Google, Google Forms, to create our survey. This tool allowed us to design a relatively simple survey with automatic data processing and charting, but with a compatible table in Excel format to allow further modification or alternative processing of the data obtained. Likewise, the system also allowed the creation of a link to the survey that could be sent to the students' email addresses from the Language Centre's email account. The fact that the tool involved no additional costs and that it was userfriendly, only requiring a few minutes to be able to start using it, was also a key factor in our decision. Google Forms requires the individual who is designing and administering the survey to have a gmail account. This email account does not need to be the one used to send the survey to the participants -which was a question of concern for us since we did not want to use an account not belonging to the university -but it will be visible in the link sent and therefore needed to have some appearance of reputability. To achieve this, we set up a gmail account for the language centre in which not only the name of the language centre was specified, but also the initials of the university, to make the sender easily identifiable.
Before designing our survey, we had to take into consideration the characteristics of the Checkboxes -controlled answers where users select as many options as they like; Choose from a list -controlled answers in which users select one option from a dropdown menu; Scale -controlled answer in which users rank something on a scale of numbers; Grid -controlled answer in which users select a point from a two-dimensional grid; Date -controlled answer in which users pick a date on a calendar; Timecontrolled answer in which users select a time of day or a length of time. Our initial intention was to use either the Text format or the Paragraph Text format, preferring the short-answer questions for the sake of conciseness. However, we also wanted to favour easy processing of the information and we realised that using this type of format would not allow the data to be processed automatically. In the end and after much consideration, we decided to use a Multiple Choice format since it limited the respondents' production and allowed for easier processing by automatically generating charts and summaries of results. In fact, Google Forms can be connected to spreadsheets in Google Sheets, and if a spreadsheet is linked to the form, responses will automatically be sent to the spreadsheet from where information is taken and automatically summarised and presented in a summary of results. For those questions in which we intended to measure a level (level of difficulty, anxiety, etc.), we used the Scale format, since the processing of results was similar to that of the Multiple choice format.

III.2.2. Analysing students' actual results in the exam:
The candidate's marks that were analysed belonged to the speaking exam of the CertACLES Certification paper administered in July 2014. This exam aims to evaluate the communicative competence of the candidates and the contents and construct of the exam and the marking criteria are based on the CEFR descriptors. To that end, the exam evaluates the four main communicative macro-skills, i.e. speaking, listening, writing and reading, each with a specific weight of 25% of the total score of the exam.
A candidate is considered to have reached the corresponding language level if the final mark is equal to or higher than 60% of the total possible points, provided that a awarded on a scale of 0 to 10 points (100%) expressed to one decimal place: • Between 6.0 and 6.9 points (60%-69% of total marks possible) = PASS • Between 7.0 and 8.9 points (70%-89% of total marks possible) = MERIT • Between 9.0 and 10 points (90%-100% of total marks possible) = DISTINCTION The speaking test is conducted by two oral examiners, an interlocutor and an assessor, with paired candidates. The reason for choosing this task format had to do with our goal of mirroring real-life communication, while minimising anxiety and tension in the candidate. As Heaton (1988) states, interviews are adequate attempts to assess oral skills but students are not placed in "natural" speech situations and they are therefore subject to psychological tensions which will necessarily affect their performances.
CertACLES exams attempt to minimise this effect by having an interlocutor and an examiner present in the interview to allow the interlocutor to focus on candidates while they speak and avoid interruptions that would occur while the interlocutor takes notes.
In this way the interlocutor is responsible for conducting the interview and for giving a global impression of the overall communicative ability of the candidate, but it is the assessor who is responsible for providing an analytical assessment of each candidate's performance. The assessor does not take part in the interaction with the candidates and is thus able to apply a detailed analytical scale with four criteria: (1) grammar, which refers to appropriate use of grammatical forms; (2) vocabulary, which measures the accuracy and the use of lexical forms; (3) Discourse management, which focuses on relevant discourse and coherence; and (4) pronunciation and interactive communication, where the focus is not only on the ability to be understandable but also the candidate's ability to take an active part in the development of the discourse. Moreover, having two candidates taking the exam together allows for equal interaction where there is no power relationship (interlocutor/candidate), but instead a conversation between two members of the same peer group.
Since the exam aims to obtain different types of oral production in a single interview, the interview is divided into three parts: Part One. Conversation between the interlocutor and each candidate. There is a set of standard questions on personal details and preferences grouped by topic (country of Part two. Simple standardised rubric with minimal language input. The candidates are each given one or two photographs (B1 candidates have one picture to describe and B2 and C1 candidates are given two pictures to allow them to use more complex vocabulary for comparison and contrast). The objective is therefore to compare and contrast during an individual long turn. After each candidate has spoken, their partner is asked one question related to the topic.
Part Three. Conversation between candidates. The interlocutor gives some pictures to the candidates. They are asked to speak for a set amount of time and justify their opinions, speculate, express preferences and draw conclusions within the target language use defined for each level of examination. At the end of the interaction the interlocutor may ask the candidates further questions on the topic.

III.3.1. Analysing students' perceptions of their performance in the exam:
A survey was designed with multiple choice questions on the candidates' profile and their opinion on the difficulty of the different sections (from 0 to 5, 0 being the easiest and 5 being the most difficult). Once the survey had been designed and implemented in Google Forms, we generated the link and sent it out to all the candidates participating in the June 2014 exam sessions (B1, B2 and C1 candidates). The email was sent after they had taken the examination so that their opinions were based on the same exam from which their marks were going to be analysed in our second step. Moreover, and to avoid bias in their responses, the link was sent before results were published and a deadline was established for the collection of responses, no responses being accepted if received after the publication of exam results.

III.3.2. Analysing students' actual results in the exam:
An excel spreadsheet was designed to introduce the candidates' results for the different parts of the exam, that is, listening, reading, writing and speaking. This would facilitate the analysis of the results, and allow for an analysis of the weaknesses and strengths of the different candidates.
A spreadsheet was designed for each of the examinations (B1, B2, C1) and the structure was as follows: Candidate number is the number assigned to each candidate for easier identification; ID, Name, Surname, are fields needed to issue the official accreditation certificate; listening mark, reading mark, writing mark, speaking mark are individual marks per skill, and overall mark is the mark obtained from the weighting of the different skills. Finally, a register number is provided for the certificate issued.

IV. RESULTS
After collecting the data from the survey and analysing the results obtained by the students in the different parts of the exam that they had rated as regards difficulty, the results were as follows.
As we can see in Figure 4, for the speaking section, most of the candidates gave a rank of 3 or higher, indicating higher difficulty; in fact, 45% of the candidates ranked the level of difficulty of the speaking section as 4 or 5.    37 As for writing, as shown in Figure 7, it was considered a medium-difficulty section, with only 30% of the candidates ranking the difficulty of the exam above 3. In light of these results, candidates considered the listening section to be the most difficult, closely followed by the speaking section, and by the writing and reading sections, which were far behind in terms of perceived difficulty. This contradicted our initial beliefs, since with their own reactions in the classroom and their reluctance to complete speaking and writing tasks, our students usually express more anxiety towards productive skills in the classroom and there is a higher demand for writing and speaking preparation courses, leaving reading and listening as areas that are not specifically prepared by students but are learnt or practised in general English courses.
As for the candidates' marks in the examination, which are translated in the table below, the results obtained were as follows: As we can see in Figure 8, the vast majority of the candidates (71%) obtained a mark that allowed them to pass the exam.    To further illustrate Table 10, in Table 5 we can see the number of candidates with one or more failed skills and a specification of the skills failed. It can also be observed how the results further indicate that candidates predominantly fail because of the writing paper, particularly at higher levels. The lower figures at C1 are due to the small number of candidates who failed (only 5 candidates failed the C1 examination).
S stands for candidates failing the speaking paper in the three levels R stands for candidates failing the reading paper in the three levels W stands for candidates failing the writing paper in the three levels L stands for candidates failing the listening paper in the three levels

V. CONCLUSIONS
The results of our study show a mismatch between self-perceived efficacy and actual performance, particularly for higher levels of proficiency. In fact, although according to our survey most candidates ranked speaking as the second most difficult skill, the results of the exam show that the number of candidates who fail the exam because of the speaking section is comparatively lower. This is not the case, however, for B1 candidates, who showed a more accurate level of self-perceived efficacy, as seen in the results in Table 10. This is in line with Bandura's (1997) statement about the perception of self-efficacy, in that candidates produce less accurate assessments as they progress through higher levels of language study.
However, as stated in our results, candidates' perceived efficacy is accurate in the case of reading and listening, since they are both the easiest and the most difficult sections and the candidates' mark reflects this as being so. The greatest mismatch is therefore in the productive skills, since writing is considered a medium-difficulty skill and it is in fact the skill in which the candidates' performance is ranked lower. Speaking is perceived as a high-difficulty skill but this difficulty is not reflected in the candidates' results, as few of them fail because of the speaking section. Therefore, candidates' perception of their efficacy in the speaking section does not seem to correspond to their actual ability, which would indicate that their perceptions are indeed influenced by factors that are external to their actual performance. FLA comes into play and, consequently, modifications in the test process should be arranged to reduce anxiety for candidates. Some of the modifications suggested would be the following: -Organising exams with paired interviews whenever possible in order to avoid relationships of power with the examiner and thus reduce stress.
-Facilitate the presence of an assessor whenever possible in order to allow the examiner to act only as the interlocutor.
-Individual arrangements for candidates to facilitate schedules and allow them to choose the time of day at which they would feel more comfortable taking the test.
-Flexible examination dates, to eliminate stress in candidates who have conflicting commitments (academic, professional, family-related, etc.).

Cristina Pérez-Guillot and Julia Zabala-Delgado
Language Value 7, 22-45 http://www.e-revistes.uji.es/languagevalue 42 -Preparation time and warm-up questions to allow them to feel more at ease with the topic, as well as get to know both the interviewer and the other candidate (in cases when the interview is paired).
-Additional prompts to facilitate discussion topics during the exam and prevent candidates from relying on their resourcefulness or imagination.
-Start and finish the interview on a positive note to improve confidence and self-image, which could then be mirrored in real-life performance.
We consider that the results of our study call for further research on factors outside the content of the exam, factors related to administration and organisation, as well as those related to individual characteristics of the candidate (personality, background, etc.), which will undoubtedly explain the difference between perceptions and actual results.

Notes
1 The survey was carried out in Spanish to allow all participants to fully understand the questions; the titles and legends in the graphs are therefore in Spanish. Under each graph there is a representation of the information in table format and translated into English.