Abstract (english) | Theoretical background: The English language is one of the most obvious characteristics of globalization, taught to non-native speakers worldwide, with the aim of helping them achieve communicative competence (cf. Tulasi & Murthy, 2022). English for Specific Purposes (ESP) is only one branch of English as a Foreign Language or English as a Second Language, and it is seen as an approach that puts the learner’s needs at the centre of the teaching process (cf. Hutchinson & Waters, 1987; Robinson, 1991; Paltridge & Starfield, 2013). ESP developed at a different speed in different countries and in several stages (cf. Johns, 2013; Upton & Connor, 2013), and today it includes cooperation and collaboration between teachers and experts from the relevant domains, and even team teaching (Dudley-Evans & St. John, 1998). Language abilities in ESP have to be assessed, and language assessment “involves obtaining evidence to inform inferences about a person’s language-related knowledge, skills or abilities” (Green, 2014: 5). Assessment is thus seen in its narrower and its wider sense – if the former, it overlaps with testing (Kramer, 2013), as it includes standardized tests and/or scales/rubrics (Bagarić Medve & Škarica, 2023), portfolios, observations, etc. (McNamara, 2000; Council of Europe, 2001); in the latter, it overlaps with evaluation, as they both include giving judgements on the learner’s language knowledge and proficiency, but evaluation goes even further and encompasses the monitoring of the development and progress of learning, as well as the goals and outcomes of teaching programmes (Council of Europe, 2001; Jelaska & Cvikić, 2008). Technology has been part of language assessment since the first half of the 20th century (cf. Fulcher, 2010; Brooks, 2017), but computers started being used in foreign language teaching in the 1980s, primarily in the USA, the UK and the Netherlands (Dunkel, 1999). Computer-based tests have many advantages in comparison with paper-based tests, but also some potential downsides (cf. Gruba & Corbel, 1997; Chalhoub-Deville, 2001; Roever, 2001; Chapelle & Douglas, 2006; Kramer, 2013; Chapelle & Voss, 2017), and the same is true for online tests or web-based tests, which proliferated with the more widespread use of the internet from the beginning of the 21st century onwards (cf. Dunkel, 1999; Chapelle, Jamieson & Hegelheimer, 2003; García Laborda, 2007; Lim & Kahng, 2012; Lesiak-Bielawska, 2015; Chapelle & Voss, 2016; Isbell & Kremmel, 2021; Li et al., 2021; Turnbull et al., 2021). Though the mode of test administration can be different, all tests have to follow certain development 7 stages – planning (which includes writing test specifications), writing test items, trialling, validation, and post-testing activities, such as writing test handbooks, training staff and test maintenance (cf. Davies, 1984; McNamara, 2000; Udier & Jelaska, 2008; Fulcher, 2010; Green, 2014; Hughes & Hughes, 2020; Green & Fulcher, 2021). As every well-designed test needs to undergo the validation process, this means that validity is one of the key qualities of any test. Even though validity is seen by some researchers as a unitary concept, which is also our standpoint, or as comprising its different types (cf. Anastasi, 1976; Messick, 1980, 1988; Angoff, 1988; Cronbach, 1988; Douglas, 2001; Urbina, 2004; Davies & Elder, 2005; Weir, 2005; Fulcher & Davidson, 2007; Milas, 2009), it is never complete, but rather at a certain degree at which the gathered data supports the intended interpretation of the test results (cf. Anastasi, 1976; Angoff, 1988; Messick, 1989; Urbina, 2004; Bachman, 2004; Weir, 2005; Fulcher & Davidson, 2007; Green, 2014; Green & Fulcher, 2021). During the process of validation, it is not enough to claim that a certain assessment is valid, but these claims have to be supported by the corresponding backing with its warrants, and all the possible rebuttals have to be rejected as much as possible, using empirical evidence (cf. Messick, 1980; Davies, 1984; Angoff, 1988; Cronbach, 1988; Messick, 1989; Kane, 1992, 2010; Bachman & Palmer, 1996; Kane, Crooks & Cohen, 1999; McNamara, 2000; Chapelle, Jamieson & Hegelheimer, 2003; Toulmin, 2003; Bachman, 2004, 2005, 2015; Urbina, 2004; Davies & Elder, 2005, 2010; Weir, 2005; Chapelle & Douglas, 2006; McNamara & Roever, 2006; Milas, 2009; Bachman & Palmer, 2010; Shepard, 2016). A conceptual framework that allows for such validation is the Assessment Use Argument (AUA), developed by Bachman (2005, 2015) and Bachman & Palmer (2010), based on Kane's interpretative argument (Kane, 1992, 2001, 2004; Kane, Crooks & Cohen, 1999) and Toulmin’s argument structure (Toulmin, 2003). AUA is divided into the Assessment Validity Argument (AVA) and the Assessment Utilization Argument – the former linking test-takers’ performance with the interpretations of their results, and the latter linking those interpretations with the possible decisions to be made based on them. There are numerous possible threats to validity (cf. Chapelle & Douglas, 2006), of which we single out test-takers’ attitudes and their level of test anxiety. Attitudes towards language, country, people, tests, teachers, learning, etc. can have an impact on test-takers’ performance (cf. Gardner, 1985; Eiser, 1986; Brunfaut & Clapham, 2013; Lasagabaster, 2013), and the level of their test anxiety can have an impact on their performance (cf. Alderson & Wall, 1993; Zeidner, 1998; Cassady & Johnson, 2002; Živčić-Bećirević, 2003; Juretić, 2008; Erceg Jugović & Lauri Korajlija, 2012). These threats have to be taken into consideration during any validation processes. AUA was used in several studies to analyse the validity of assessments used in different educational contexts (cf. Llosa, 2008; Wang et al., 2012; Long et al., 2018; Jun, 2021; Park, 2021), which proved beneficial for the present study in terms of comparison of results and the methodology applied. The other groups of selected research that follow were also used to compare the results and to use some of the same instruments, as described later in this summary. The second group of studies compared the validity of paper-based tests with computer-based and online tests (cf. Al-Amri, 2007; Hewson et al., 2007; Khoshsima et al., 2017; Öz & Özturan, 2018; Hewson & Charlton, 2019); the third group investigated whether test-takers’ performance is impacted by their attitudes towards tests and computers (cf. Gorsuch, 2004; Fan & Ji, 2014; Dizon, 2016; Hartshorn & McMurry, 2020; Chung & Choi, 2021; Hoang et al., 2021); the fourth group analysed the attitudes of other test users, such as teachers, principals and administrators (cf. Winke, 2011; Abduh, 2021; Ghanbari & Nowroozi, 2021; Yulianto & Mujtahin, 2021; Zhang et al., 2021; Lučev et al., 2022); the last selected group of studies investigated the potential effect of test anxiety on test takers’ performance (cf. Cassady & Johnson, 2002; Chapell et al., 2005; Juretić, 2008; Aliakbari & Gheitasi, 2017). The present study: Owing to technological advancements, online tests are becoming more frequent due to their advantages, and their administration is becoming indispensable in online classes. A lot of research is being published that analyses online tests and compares them to paper-based tests, and there is a great deal of researchers investigating the impact of learners’ attitudes and test anxiety on their performance in online tests, but the field of analysing the validity of online assessment applying a systematic and methodological approach is still not fully explored. In the Croatian context, to our knowledge, there are no extensive studies focusing on the analysis of the validity of online assessment. Therefore, the purpose of our research is to offer empirical evidence in this domain. The main aim of this thesis is to analyse the validity of online assessment in English for Specific Purposes (ESP), and the individual aims are as follows: 1) to compare the online test and paper-based test scores; 2) to examine the participants’ attitudes towards online tests, tests in general and computers; 3) to examine the participants’ test anxiety level. 9 The research questions (RQ) guiding us in reaching the abovementioned goals are the following: RQ1: Do online tests test reading comprehension, comprehension and use of vocabulary and written production abilities in ESP? RQ2: Are scores in online tests connected with the participants’ attitudes towards online tests, tests in general and computers? RQ3: Are scores in online tests connected with the participants’ test anxiety level? On the basis of the stated aims and previous research, the following hypotheses (H) are formed: H1A: Online tests test the reading comprehension ability in ESP. H1B: Online tests test the comprehension and use of vocabulary ability in ESP. H1C: Online tests test the written production ability in ESP. H2A: There is no correlation between online test scores and the participants' attitudes towards online tests. H2B: There is no correlation between online test scores and the participants' attitudes towards tests in general. H2C: There is no correlation between online test scores and the participants' attitudes towards computers. H3: There is no correlation between online test scores and the participants' test anxiety level. Participants, instruments and method: To achieve the abovementioned aims and test the hypotheses, we conducted our study with 122 participants, who were first-year graduate students of Economics at one Croatian university, enrolled in a Business English (BE) course at B1-B2 level according to the CEFRL* . They were taught by ten different teachers, one of whom was also the author of this study and of the test that was used as one of the instruments, both in its online and paper-based versions. The test comprised reading comprehension, comprehension and use of vocabulary, and written production tasks. The participants’ performance in the latter was assessed by two ESP experts, who were the two other participants in this study. The other instruments were four structured questionnaires: Attitudes towards online BE tests questionnaire (which was adapted from Fan & Ji, 2014), Attitudes towards tests in general questionnaire (which was the translation of the Attitudes Toward Tests Scale, created by Dodeen 2008 in Muñoz, 2017), Attitudes towards computers questionnaire (which was the translation of the Computing Attitudes Study, conducted by Hewson et al., 2007 and Hewson & Charlton, 2019), and Test anxiety questionnaire (which was the translation of the Westside Test Anxiety Scale, created by Driscoll, 2007 in Aliakbari & Gheitasi, 2017). The data for our study were gathered in 2021 in two stages – the participants first took the online test and completed two questionnaires in Google Forms, and three and a half months later they took the paper-based test and completed the remaining questionnaires on paper. These data were then analysed using a mixed methods approach, starting with the qualitative analysis, which included the Assessment Use Argument (AUA). The quantitative portion of the study comprised the descriptive and inferential statistics – the correlation analysis and the analysis of variance. The aim was to analyse the interaction between the two factors (testing mode, i.e. online and paper-based, and test sections, i.e. reading comprehension, comprehension and use of vocabulary, written production) and to see whether the impact of one factor depends on the levels of the other. Furthermore, we wanted to examine the correlation of variables, i.e. the participants’ test results and their responses in the questionnaires. Ethical principles were adhered to while conducting this study (cf. Shuster, 1997; Steneck, 2006; Resnik & Shamoo, 2011; Kraš & Miličević, 2015; Truog et al., 2015; OUZP, 2016; Cergol, 2021), as well as certain testing rituals (Fulcher, 2010). In addition, a formal request was sent to the Head of the Foreign Languages Department and to the BE teachers, in which they gave their consent for this study, and an informed consent form was sent to the participants, which explained the present study and guaranteed their anonymity and allowed for the possibility to quit the study without any prejudice. Finally, the motivation for the participants to take part in this study was boosted by offering them intellectual and material awards, i.e. opportunities for further practice and gift cards (Cergol, 2021). Results and discussion: In line with the first aim of this study, the quantitative analysis of the data has shown that the mean value of the participants’ scores in the online test was higher than in the paperbased test overall, and in the reading comprehension and the written production section; the opposite is true for the comprehension and use of vocabulary section. 11 Additionally, the participants’ responses in the questionnaires (Table 7) imply that their attitudes towards online BE tests are generally positive (M = 4.78, min = 2.75, max = 6.00) and that their attitudes towards tests in general are somewhat positive (M = 3.2, min = 2.00, max = 4.59). Next, their attitudes towards computers were grouped according to the factor analysis of this questionnaire, and they show relatively low level of computer anxiety (M = 2.12, min = 0.96, max = 4.47) and computer addiction (M = 2.15, min = 1.05, max = 4.47). Finally, the participants demonstrated a moderately low level of test anxiety (M = 3.5, min = 1.00, max = 5.00). After the two ESP experts assessed the participants’ written production in both online and paper-based tests in three categories (i.e. form, content, language) and in total, a t-test was performed (Table 9) and the ICC was calculated (Table 10) to analyse the interrater reliability. The results have shown that the interrater reliability was excellent or good (ICC values between 0.78 and 0.97) in all variables except in “paper-based test - language”, where it was moderate (ICC = 0.73), as explained by Koo & Lee (2016). Furthermore, to examine hypotheses H1A, H1B and H1C, a two-way repeated measures analysis of variance was conducted and it has shown that the first effect, i.e. the testing mode, is not statistically significant, which means that the online and the paper-based tests test the same participants’ abilities, i.e. their reading comprehension, comprehension and use of vocabulary and written production. The other effect, i.e. the test content, was statistically significant, and the interaction between the two main effects can be seen in Picture 22. Further t-tests show that there are differences in the participants’ scores in the categories “writing production - total” (t = 2.996; df = 119; p < 0.01) and “comprehension and use of vocabulary” (t = -2.589; df = 119; p < 0.05) – in the former, the participants achieve better results in the online test (M = 10.837; SD = 3.305), and in the latter, their results are better in the paper-based test (M = 12.1; SD = 2.317). To examine hypotheses H2A, H2B, H2C and H3, correlation analyses were conducted (Table 12), and they demonstrate that the participants’ attitudes towards online BE tests, tests in general and computers, as well as their level of test anxiety, do not show a statistically significant correlation with any section of the test. The only correlation that has been established was a weak negative correlation between the participants’ test anxiety level and the reading comprehension section of the test (r = -.205; p < .05). The qualitative analysis applied the part of the Assessment Use Argument which connects the participants’ test performance to the interpretation of these results, i.e. the Assessment Validity Argument (AVA), as shown in Figure 22. The first claim in our AVA is that the participants’ online test results are consistent, for which the backing is that the test administration procedures are followed consistently, that the assessment procedures are good and also followed consistently, and that the results are consistent in different testing modes. The warrants for these backings are an expert author of the test, an expert examiner and invigilator, expert BE assessors for one section of the test, the use of Google Forms for automatic scoring of the other two sections of the test, and finally the abovementioned twoway repeated measures analysis of variance. The rebuttals for this first claim were rejected by the rebuttal data, which included the evidence of a quality test design, administration and assessment, as well as the earlier mentioned interrater reliability analysis and the abovementioned correlation analyses. The second claim in our AVA is that the interpretation that the online test is a valid indicator of the reading comprehension, comprehension and use of vocabulary and written production abilities is meaningful, impartial, generalizable, relevant and sufficient. The backings for this claim are a well-defined construct, clear test specifications, the best possible performance of the participants, impartial test tasks, inoffensive test content, clear scoring criteria, impartial test administration, the online test that corresponds to the target language use, the online test assessment and criteria procedures that correspond to the ESP context, and finally, the scores that are relevant and sufficient indicators of the level of the participants’ abilities. The warrants for these backings are the online test which is in accordance with the course outcomes, syllabus and teaching materials, the previously mentioned experts (i.e. the author, examiner, invigilator and assessors) and Google Form, the well-explained scoring criteria that are also written on the test itself, as well at the well-explained procedures and conditions of the test administration. The rebuttals for this second claim were rejected by the rebuttal data, which showed that the course outcomes, syllabus and teaching materials had been used in class and were constantly available to the participants in Google Classroom; that the participants’ attitudes towards online BE tests, test in general and computers are not in a statistically significant correlation with any section of the test; that the scoring criteria were clear and explained to the participants; that all participants were able to access the test equally and demonstrate their language proficiency; that the participants’ scores were a relevant and sufficient indicator of the level of their abilities. The only data that weakened one of the rebuttals were the results on the test anxiety level questionnaire, because of the previously mentioned weak negative correlation that has been established between the participants’ test anxiety level and the reading comprehension section of the test. 13 Therefore, we can say that hypotheses H1A, H1B and H1C have been confirmed because of the weak positive correlation between the online and paper-based test, and the statistically significant content of the test in all its three sections. Our conclusion that the testing mode has no effect on the participants’ performance is in line with the previous studies by AlAmri (2007), Hewson et al. (2007), Mohammadi & Barzgaran (2012), Öz & Özturan (2018). Equally, we can say that hypotheses H2A, H2B and H2C have also been confirmed because the participants’ attitudes towards online BE tests, tests in general and computers are not statistically correlated with their results in any of the sections of the exam. Our results are in accordance with the studies that show that computer familiarity and attitudes towards computers do not have a significant impact on computer-based tests (Al-Amri, 2007; Khoshsima et al., 2017), on online tests (Hewson et al., 2007; Hewson & Charlton, 2019), and that the computer familiarity level does not have an impact on an online writing test (Mohammadi & Barzgaran, 2012). Furthermore, our results are partly in accordance with those of Fan and Ji (2014), which could be due to the fact that they analysed only the attitudes towards a certain paper-based English test. On the other hand, our results are completely the opposite of Gorsuch’s (2004), maybe due to the fact that she analysed only a listening test on a sample of six participants, and to those of Hartshorn and McMurry (2020), who focused on the participants’ attitudes towards online English classes due to COVID-19 pandemic. Hypothesis H3 has mostly been confirmed because the correlation analysis results indicate that the participants’ levels of test anxiety are statistically correlated only with the reading comprehension section of the test, as previously mentioned. Therefore, we can say that the participants with a lower test anxiety level achieve better results in this section, which in part corresponds to the studies conducted by Cassady & Johnson (2002), Chapell et al. (2005), Juretić (2008), and Erceg Jugović & Lauri Korajlija (2012), Conclusion, implications and limitations: The validity of online assessment is still an insufficiently researched area where new insight appears every day. What is important to note is that whatever the reason for its use, online assessment needs to be valid, irrelevant of the fact whether validity is seen as a unitary concept, which is also our standpoint, or as comprising its different types. Validity thus needs to be proven, bearing in mind that it is never complete, but at a certain degree, and in this study the Assessment Use Argument (AUA) was used as one possible validation method, i.e. its Assessment Validity Argument (AVA) part. Other researchers have also used the AUA in their studies (Chapelle, Jamieson & Hegelheimer, 2003; Llosa, 2008; Wang et al., 2012; Long et al., 2018; Jun, 2021; Park, 2021), which enabled us to compare their methodology with ours, since we have also used it in our research, focusing on its AVA part, in order to analyse the validity of online assessment in English for Specific Purposes (ESP). Our results show that the results that our participants achieved during their performance in an online Business English test are consistent, and that the interpretation of these results – which says that the online test is a valid indicator of the reading comprehension, comprehension and use of vocabulary and written production abilities – is meaningful, impartial, generalizable, relevant and sufficient. Therefore, because the empirical evidence used is our research rejected all possible rebuttals and weakened only one of them, we can conclude that our AVA is strong (cf. Fitzgerald & Clenton, 2010). Taking into consideration the fact that validity “is a matter of degree, not all or none” (Messick, 1989: 33), we can also conclude that our results increase the degree of validity of online ESP tests (cf. Anastasi, 1976; Angoff, 1988; Messick, 1989; Anderson & Banerjee, 2001; Urbina, 2004; Bachman, 2004; Davies & Elder, 2005; Weir, 2005; McNamara & Roever, 2006; Fulcher & Davidson, 2007; Milas, 2009; Green, 2014; Fitzpatrick & Clenton, 2010; Shepard, 2016; Chapelle, 2021; Green & Fulcher, 2021). Consequently, we can also conclude that the results of this research, which was conducted in the Croatian context, are significant because they are comparable with the results of all the studies previously mentioned in this summary that were conducted in different countries and that used various instruments. This demonstrates the significance of our research, placing it in a global context, to which it contributes with the applied analytical procedures and methods. What is more, being the first systematic research of validity in Croatia that used the AUA, it has also contributed to Croatian terminology by offering translations of some crucial notions in the area of assessment, as is shown in the Glossary (Appendix 7). Furthermore, the results of the present study can be used by all test users – from curriculum, syllabus and test authors, to teachers and students or test-takers – who have this way been given an argument for using an online assessment mode. This is especially important in situations when paper-based assessment cannot be administered due to a variety of reasons, such as pandemics, wars, etc., but also when online assessment is the preferred mode because of institutional or individual requirements or desires for digitalization of learning, teaching and assessment. Moreover, our results can be applied not only in ESP university courses (cf. Long et al., 2018) but in other tertiary, secondary and primary education institutions, i.e. wherever test users need to validate the applied instrument. This will improve the quality of online 15 assessment because the results achieved by test-takers will be valid indicators of the abilities that the test in question purports to actually be testing. This can apply to medium-stake assessment, such as the one used in the present study, i.e. one that assesses test-takers’ achievement in a certain area, but also to low-stake assessment, i.e. one that is used formatively, providing feedback on their performance (cf. Roever, 2001). Finally, AVA can be beneficial for teachers, too, because it can point them towards the areas they could improve while assessing the learners’ mastery of certain standards (cf. Llosa, 2008). As far as the limitations of the present study are concerned, the first one is the sample. It included only first-year students, who were also taught by different teachers (one of whom was the author of the present study) and were unevenly distributed in this regard. The second one is the language test – the construct of the BE test was harmonized with the syllabus and with the university’s test administration rules, and the researcher was at the same time its author. The third limitation was the assessment of the written production, which was performed by two LSP experts, one of whom was the researcher. The fourth limitation were the other applied instruments, i.e. the four questionnaires. They were used following the decision made after the examination of available scientific literature that was in accordance with the aims of our study. Taking the abovementioned into consideration, the following is suggested as possible further research: include more participants from different years at university and in equal numbers from each respective teacher; include other test users in the research, such as teachers, examiners and administrators, to test their attitudes towards the test that is being validated (Winke, 2011); expand the test so that it includes more test items and/or more sections, such as listening comprehension and spoken production and interaction; use more assessors; use automatic writing evaluation applications if the technological advancements allow, making sure that the assessment is not lead by technology, but that technology only enhances it (Brunfaut, 2023); use more questionnaires that examine the participants’ attitudes; compare the syllabi of other courses, not only of foreign languages, and develop an instrument that would test those abilities that are common to all relevant courses; ensure that the author of the test is not simultaneously the validator of that test (cf. Davies & Elder, 2005; Winke, 2011). |