Please note that the scoring key may be requested by sending the signed Data Use Agreement to zis@gesis.org or the original test authors (Thomas Gatzka and Judith Volmer; see Scoring Key Authorization). The following instruction describes how the test was applied in a computer-assisted format in this study. In contrast to the German test version, we only asked participants to pick the best out of four response options. The German version additionally asks for the worst response option. We chose this shortened instruction to reduce the test duration in an initial pilot study. The complete instruction is presented within square brackets and in grey font colour.
We recommend a computer-assisted format to administer the test. A computer-assisted format usually forces test-takers to give the correct number of responses. However, in a paper-pencil format, this may not be obvious to test-takers. Thus, the following sentence should be included in the test instruction subsequent to the sentence “Please always select the best and the worst option for each situation” if a paper-pencil format instead of a computer assisted one is used:
“Please select exactly two response options for each situation. Mark (+) for the best solution and (−)
for the worst solution.”
When asking only for one (here: the best) option in a paper-pencil format, please insert the following sentence in the test instruction subsequent to the sentence “Please always select the best option for each situation”:
“Please select exactly one response option for each situation.”
Below, 12 situations are described as they may occur in the occupational daily routine of teams or working groups. For each situation, four different behavioural options are presented.
Please pick the most [and least] suitable behaviour for each situation.
For some situations, it may be difficult for you to decide as certain details are not specified, you did not experience a similar situation before, or you consider some options very similar. However, please choose the alternative[s] that you generally take for the best [and worst] solution.
Please always select the best [and the worst] option for each situation. [Please do not indicate the same answer as the best and worst solution.]
Please do not skip any situation.
Example
Your team has a task that is fundamentally different to previous tasks and covers completely new aspects. In addition, it is very likely that aspects of the task will change in the medium term.
What should your team do [and not do] in such a situation?
a) |
Some members of the team do not assist with the task to stay flexible. |
( [+] ) |
[( X )] |
b) |
All aspects of the tasks are assigned to several competent members of the team. |
( [+] ) |
[( − )] |
c) |
The team asks a supervisor to assign task aspects. |
( [+] ) |
[( − )] |
d) |
Task aspects are assigned as needed in regular meetings. |
( X ) |
[( − )] |
Items
Table 1
Items of the English Version of the SJT-TW
The answers are given in a forced-choice format. The best and the worst solution has to be identified. The respondent’s task is to indicate how they (or the entire team) should behave.
Scoring
For each scenario there is a predefined best and worst solution, which can be taken from the scoring key. The scoring key may be requested by sending the signed Data Use Agreement to zis@gesis.org or the original test authors (Thomas Gatzka and Judith Volmer; see Scoring Key Authorization). If test-takers correctly choose the best solution, the response is coded as “1”. If test-takers correctly choose the worst solution, the response is also coded as “1”. If test-takers select the best solution as their worst solution or vice versa, the responses are scored as “-1”. All remaining responses are scored as “0”. Item scores may be obtained by summing the best and worst of each scenario. Thus, each scenario can have values from -2 to +2. To obtain a score for the total test, values across scenarios are added up to an unweighted sum score. The total test score may range from -24 to +24. Test scores may also be obtained separately for best and worst responses across scenarios. When analysing best and worst responses separately (or only one of them to reduce participation time as we did in this pilot study), item scores for each scenario can range from -1 to +1 and test scores across scenarios can range from -12 to +12.
Adequate methods may be applied to deal with missing values (i.e., multiple imputation; full information maximum likelihood).
Application field
This test should be applied to assess knowledge about teamwork effectiveness in research settings (given the lack of validity evidence for the translated version of the test, we do not encourage its use beyond research settings). This test can be applied independently from actual teams or team tasks. For instance, the original development study (Gatzka & Volmer, 2017) validated this test with a student sample as well as a sample of employees. It may be particularly useful for teamwork research (see Gatzka & Volmer, 2017). The test may be applied in a computer-assisted or a paper-pencil self-administered questionnaire format. For this study, we chose a computer-assisted questionnaire format. On average, participants took 6.57 minutes (SD = 1.63) to complete the shortened test asking only for the best option. Hence, participation will take approximately 10 minutes if test-takers are asked to pick both the best and worst response options.
The reported test is a translation of the German Situational Judgment Test for Teamwork (SJT-TW; Gatzka & Volmer, 2017). SJTs are popular tools in personnel selection and are traditionally defined as low-fidelity simulations (Motowidlo et al., 1990). Most SJTs consist of written situation descriptions and several behavioural response options of which test-takers chose the most similar to how they should or would behave in the given situation (McDaniel & Nguyen, 2001). As such, they sample knowledge about effective behaviours in relevant situations for work-related criteria (Motowidlo et al., 1990; Weekley et al., 2015). Meta-analyses confirmed the predictive power of SJTs for job performance criteria (Christian et al., 2010; McDaniel et al., 2001, 2007).
Effective teamwork can be best described as a set of various behaviours rather than a single, narrow construct (Salas et al., 2005; Rousseau et al., 2006). Gatzka and Volmer (2017) integrated results from two reviews on teamwork to develop a working model and to identify core elements of team effectiveness (Salas et al., 2005; Rousseau et al., 2006). Furthermore, they considered two models that have already been implemented in test procedures (O'Neil, et al., 1997; Stevens & Campion, 1994) as well as further reviews on team processes and team efficacy (Kozlowski & Ilgen, 2006; Marks et al., 2001; Mathieu et al., 2008). The working model (Gatzka & Volmer, 2017) consisted of 30 behaviours particularly relevant for teamwork success which can be categorizes into seven dimensions: (1) evaluation of the operational framework, (2) planning and organisation, (3) cooperation, (4) communication, (5) monitoring and adaptation, (6) help behaviour and support and (7) motivation and cohesion.
Gatzka and Volmer (2017) identified SJTs as suitable tool for the assessment of teamwork effectiveness. These authors demonstrated that the SJT-TW correlated with measures of teamwork skills and even predicted supervisor-rated contextual and teamwork performance. Overall, the original version of the test was well in line with contemporary conceptualizations of teamwork effectiveness and thus a valuable tool for teamwork research and personnel selection (Gatzka & Volmer, 2017).
Beyond the intended use of the SJT-TW for teamwork research, the test may be useful for research on the underlying psychological processes of SJTs. Despite the well-established criterion-related validity of SJTs, it remains unclear why SJTs work as assessment methods (e.g., Freudenstein et al., 2020; Lievens & Motowidlo, 2016; McDaniel et al., 2016; Schäpers et al., 2019). For instance, the role of situations for test-takers’ responses to SJT items is subject to debate. Some argue in favour of processes that are similar to those underlying behaviour in real-life situations, while others advocate context-independent constructs (e.g., Freudenstein et al., 2020; Lievens & Motowidlo, 2016; Schäpers et al., 2019). However, the number of SJTs that are available to research is limited. Thus, an English translation of the SJT-TW would further enable research about underlying processes of SJTs.
Item generation and translation
Gatzka and Volmer (2017) used the 30 behaviours from their working model on team effectiveness to develop a Situational Judgment Test. Their final test consists of 12 hypothetical situations or scenarios that reflect a problem concerning teamwork and four behavioural response options for each situation. Test-takers are asked to indicate the best and worst solution for each situation. The SJT showed substantial correlations with related constructs and job-related criteria.
To translate the SJT-TW to English, we utilized the TRAPD procedure (Harkness, 2003). TRAPD is an acronym for several steps needed to produce high-quality translations of questionnaires, namely translation, review, adjudication, pretesting and documentation. We created two independent translations of the SJT-TW. The overall aim was to retain as much original item content and structure as possible. Both translators were fluent in spoken and written English and had expertise in SJT research. However, both translators were neither native speakers nor professional translators. The first author reviewed both translations and merged them into a single version. Afterwards the translators revised this version with regard to word flow and completeness of the original item content. Two independent native speakers then additionally reviewed this revised test version. All changes were adopted accordingly. Next, a senior researcher with high expertise in psychological assessment and SJT research made final changes to the translation.
In this study, we pilot-tested the translated SJT with a small sample to gauge whether test-takers understood all items and to inspect preliminary response patterns. We instructed participants to pick the response options that best resembles what they should do in each of the 12 scenarios. To reduce the duration to participate, we did not instruct participants to pick the response option that resembles the worst solution. This is contrary to the original test format. We scored responses with “1” if they reflected the most effective response, with “-1” if they reflected the most ineffective response, and all remaining responses with “0”. Please note that interpretations of these results are only preliminary and should be made with caution due to the small sample size. Data was analysed with R (version 3.6.1; R Core Team, 2019) and the R package psych (version 1.8.12; Revelle, 2018).
Sample
Data for the English version of the SJT-TW was collected in 2019 from the following convenience sample from the United States: N = 20 native speakers (American English) from Amazon MTurk; sex: = 40% female; age: M [min; max] = 35.25 [25; 53], SD = 9.21. Most participants (75%) were gainfully employed during the time of data collection. Participants had either an undergraduate (45%) or graduate degree (20%) or received vocational training (5%). The remaining 30% of the sample graduated high school. Test-takers received $1 for participation. No a-priori power analysis was conducted, as this was a pilot study. No missing values occurred.
Item parameter
Table 2 presents item parameters for the 12 SJT items. Item distributions were somewhat similar to those of the German version (Gatzka & Volmer, 2017). The range of item total correlations was also comparable between the German and the English version of the SJT-TW, with a slightly higher mean of item-total correlations for the English version (rit = .22 vs. .17). The internal consistency of SJTs is typically low (Catano et al., 2012; Kasten & Freund, 2016). Thus, small item-total correlations were to be expected. However, item 11 showed a negative item-total correlation. This may be due to the small sample size of this pilot study. Nevertheless, if a negative item-total correlation persists in future applications, this item should be excluded from further analyses.
The reported item-total correlations presume a single factor structure of the SJTs. This is in line with recommendations by Gatzka and Volmer (2017). However, these authors also proposed a two-factor structure of the SJT-TW (Factor 1: Items 2, 3, 5, 6, 7, 9, 10, 12; Factor 2: Items 1, 4, 8, 11). Gatzka and Volmer (2017) argued that this factor structure can only be interpreted as preliminary evidence due to the small number of items and low internal consistencies of the two factors. They concluded that only a total test score should be calculated. An investigation of the factor structure of the translated SJT-TW was not sensible due to the small sample size of N = 20.
Table 2
Means, Standard Deviations, Skew, Kurtosis and Item-Total-Correlations of the Manifest Items
|
M |
SD |
Skew |
Kurtosis |
rit |
Item 1 |
0.70 |
0.47 |
-0.81 |
-1.41 |
0.43 |
Item 2 |
0.00 |
0.65 |
0.00 |
-0.74 |
0.39 |
Item 3 |
0.25 |
0.55 |
0.11 |
-0.60 |
0.04 |
Item 4 |
-0.20 |
0.41 |
-1.39 |
-0.07 |
0.13 |
Item 5 |
0.40 |
0.60 |
-0.34 |
-0.95 |
0.49 |
Item 6 |
0.40 |
0.50 |
0.38 |
-1.95 |
0.25 |
Item 7 |
0.25 |
0.91 |
-0.47 |
-1.68 |
-0.02 |
Item 8 |
0.20 |
0.70 |
-0.25 |
-1.06 |
0.38 |
Item 9 |
0.25 |
0.72 |
-0.36 |
-1.12 |
0.11 |
Item10 |
0.15 |
0.67 |
-0.15 |
-0.93 |
0.32 |
Item 11 |
0.25 |
0.44 |
1.07 |
-0.89 |
-0.22 |
Item 12 |
0.15 |
0.75 |
-0.22 |
-1.27 |
0.29 |
Note. Scale ranging from -1 to 1 as test-takers only were asked to pick the best response option, N = 20.
Objectivity
The English translation of the SJT-TW is a standardised psychological instrument like the German original SJT-TW. Each test-taker receives the same instruction, items and response options. The answers are evaluated by means of a solution key. Hence, objectivity of application and evaluation is assured. Due to the ambiguous factor structure of the SJT-TW, individual test scores should be interpreted with care. Rather than allocating psychological meaning to individual tests scores, sum scores of the SJT-TW should be interpreted as indicators that correlate with various constructs such as job performance and team skills. This is not unique to this specific test but rather representative for most SJTs (see McDaniel et al., 2016).
Reliability
The reliability of the scale was determined by internal consistency estimator Cronbach’s alpha. Coefficient alpha of the 12 SJT items was α = .52. Although this internal consistency is insufficient, it is similar to the sample of employees in the original validation study (α =.44; Gatzka & Volmer, 2017). Moreover, the internal consistency of the SJT-TW is in line with meta-analyses on the internal consistency of SJTs (Catano et al., 2012; Kasten & Freund, 2016). These low values generally reflect the ambiguous factor structure of SJTs.
Validity
Based on results from this very small sample, we tentatively conclude that the overall scale worked very similar to the German version and did not cause any major inconsistencies. Still, a proper validation study is needed before using the English SJT-TW beyond research settings. We consider the current version as a research version, which should not be used in high stakes settings.
Descriptive statistics (scaling)
The test sum score had a mean of 2.80 (SD = 3.02) with a skewness of -1.04 and a kurtosis of 0.37. Thus, participants chose on average more correct than incorrect response options This result is in line with the German test version (Gatzka & Volmer, 2017).
Further quality criteria
The test processing takes about 10 minutes (or 6.57 minutes (SD = 1.63) to complete the shortened test asking only for the best option), which indicates that the test is a very economical instrument. Research also suggests that SJTs are less susceptible to faking behaviour, especially when compared to personality self-reports (Kasten et al., 2018).
Further literature
Gatzka, T., & Volmer, J. (2017). Situational Judgment Test für Teamarbeit (SJT-TA). In Zusammenstellung sozialwissenschaftlicher Items und Skalen. https://doi.org/10.6102/zis249
− Jan-Philipp Freudenstein, Freie Universität Berlin, E-Mail: jpfreudenstein@gmail.com