RoBuster—Corpus Annotated With Risk of Bias Text Spans in Randomized Controlled Trials in Physiotherapy and Rehabilitation: Corpus Development and Annotation Study

doi:10.2196/55127

¹Informatics Institute, HES-SO Valais-Wallis, Rue du Technopole 3, Sierre, Switzerland

²Institute of Health Sciences, HES-SO Valais-Wallis, Thermenstrasse 41, Leukerbad, Valais, Switzerland

³Physiotherapy Tschopp & Hilfiker, Brig, Valais, Switzerland

⁴Centre national de la recherche scientifique, Laboratoire Interdisciplinaire des Sciences du Numérique, Université Paris-Saclay, Orsay, France

⁵Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, United Kingdom

⁶Centre d'Informatique Universitaire, University of Geneva, Geneva, Switzerland

⁷Medical Faculty, University of Geneva, Geneva, Switzerland

⁸The Sense Innovation and Research Center, Sion, Switzerland

*these authors contributed equally

Corresponding Author:

Anjani Dhrangadhariya, PhD

Background: Risk of bias (RoB) assessment of randomized clinical trials (RCTs) is vital to answering systematic review questions accurately. Manual RoB assessment for hundreds of RCTs is a cognitively demanding and lengthy process. Automation has the potential to assist reviewers in rapidly identifying text descriptions in RCTs that indicate potential risks of bias. However, no RoB text span annotated corpus could be used to fine-tune or evaluate large language models (LLMs), and there are no established guidelines for annotating the RoB spans in RCTs.

Objective: The revised Cochrane RoB 2 test (RoB 2) tool provides comprehensive guidelines for RoB assessment; however, due to the inherent subjectivity of this tool, it cannot be directly used as RoB annotation guidelines. The study aimed to develop precise RoB text span annotation instructions that could address this subjectivity and thus aid the corpus annotation.

Methods: We leveraged RoB 2 guidelines to develop visual instructional placards that serve as annotation guidelines for RoB spans and risk judgments. Expert annotators used these visual placards to annotate a dataset named RoBuster, consisting of 41 full-text RCTs from the domains of physiotherapy and rehabilitation. We report interannotator agreement (IAA) between 2 annotators for text span annotations before and after applying visual instructions on a subset (n=9) of RoBuster. We also provide IAA on bias risk judgments using Cohen κ. Moreover, we used a portion of RoBuster (n=10) to evaluate an LLM using a straightforward evaluation framework. This evaluation aimed to gauge the performance of an LLM (here GPT 3.5) in the challenging task of RoB span extraction and demonstrate the utility of this corpus using a straightforward framework.

Results: We present a corpus of 41 RCTs with fine-grained text span annotations comprising more than 28,427 tokens belonging to 22 RoB classes. The IAA at the text span level calculated using the F1 measure varies from 0% to 90%, while Cohen κ for risk judgments ranges between –0.235 and 1.0. Using visual instructions for annotation increases the IAA by more than 17 percentage points. LLM (GPT-3.5) shows promising but varied observed agreements with the expert annotation across the different bias questions.

Conclusions: Despite having comprehensive bias assessment guidelines and visual instructional placards, RoB annotation remains a complex task. Using visual placards for bias assessment and annotation enhances IAA compared to cases where visual placards are absent; however, text annotation remains challenging for the subjective questions and the questions for which annotation data are unavailable in RCTs. Similarly, while GPT-3.5 demonstrates effectiveness, its accuracy diminishes with more subjective RoB questions and low information availability.

JMIR Form Res 2026;10:e55127

doi:10.2196/55127

Keywords

risk of bias; corpus annotation; natural language processing; large language models; LLM; information extraction; RoBuster; corpus; randomized controlled trials; RCT; reviewer; tools; physiotherapy; rehabilitation; effectiveness

Systematic reviews (SRs) synthesized using randomized controlled trials (RCTs) are the highest quality of evidence in the evidence pyramid. SRs help medical professionals make informed health decisions and guide health policies [1,2]. An RCT tests an intervention’s effectiveness by randomly assigning patients to intervention groups; for example, the impact of the intervention under investigation is compared to other interventions in a controlled setting [3]. Theoretically, RCTs are low on biases given the randomized design, but biases can still infiltrate the design, execution, or reporting phases. Such biases may cause medical professionals to misjudge intervention effects, impacting health outcomes [4,5]. Therefore, bias assessment, known as risk-of-bias (RoB) assessment, is vital for RCTs used for writing SRs.

There are several tools to assess RoB, including the Cochrane Collaboration’s RoB Tool, PEDro RoB scale, revised Cochrane RoB 2 tool, AMSTAR (A Measurement Tool to Assess Systematic Reviews) or AMSTAR 2, EPOC (Effective Practice and Organization of Care) RoB Tool, and other checklists [6-11]. These tools provide structured questions to elicit bias-relevant information from RCTs. Manual RoB assessment, a time-consuming task requiring substantial expertise, can take hours per RCT and months for a full SR, emphasizing the need for automation [12,13]. Moreover, writing SRs is highly resource-heavy, taking about 6 months to several years to complete [6,14,15]. Machine learning (ML) could expedite this process by pinpointing bias-relevant RCT text, aiding quicker quality assessments [16]. Marshall et al [17] attempted automation of RoB assessment using a distant supervision approach supported by proprietary data from CDSR. The study was supported by the manually entered data from CDSR, which is behind a paywall and automates based on Cochrane RoB 1 guidelines and not the latest RoB 2 [7]. Although RoB 1 is most frequently used for assessment, the recently revised Cochrane RoB 2 offers significant differences [18]. Compared to RoB 1, RoB 2 provides a reliable and concrete structure to the RoB evaluation by developing comprehensive guidelines that aim to increase consistency [7,11]. The use of RoB 2 increased from 0% in 2019 to 24.1% in 2022, indicating the need to switch to this updated tool for bias assessment [19].

Millard et al [20] also explored ML-based RoB assessment using proprietary data. Their work using these pay-walled data was used to develop RobotReviewer, which has been evaluated by several studies for its human-level performance [17,21-24]. A lack of public RoB-annotated data still limits community advancements. Wang et al [25] recently released RoB datasets for animal studies, but human clinical trials still lack a comprehensive RoB corpus. Manual RoB assessment is a complex, expert-led task laden with subjective judgments [26,27]. Systematically translating this manual process for developing a RoB annotated corpus requires a carefully designed annotation scheme and annotation guidelines. We previously worked on a pilot study to test whether RoB 2 guidelines could be effectively used as annotation guidelines to annotate a corpus of RCTs with RoB. We concluded that RoB 2 cannot be used as text annotation guidelines but did not provide any annotation guidelines [28]. Here, we aim to establish clear annotation guidelines to annotate RCTs with RoB spans corresponding to the RoB 2 tool [11].

In addition, recent large language models (LLMs) have shown potential in handling complex tasks with minimal instructions [29]. However, their capability to identify RoB spans in RCTs has yet to be assessed. Our contributions with this paper are 5-fold as follows:

Development of detailed annotation guidelines for RCT RoB spans.
Development of visual placards to simplify annotation and assist trainee RoB assessors.
Compilation of “RoBuster,” a corpus of 41 annotated RCTs with 22 RoB span types, for ML and LLM training or benchmarking.
Evaluation of an LLM in identifying answers to signaling questions using prompts.
Open sharing of annotation guidelines, dataset, and LLM prompts with the community.

Overview

This section details the annotation scheme, software, and visual placard development. With no existing RoB span annotation guidelines, we created them from scratch based on the revised Cochrane RoB 2 tool [6]. We created a draft of the visual guidelines, doubly annotating a fraction of RCTs using it, and refined the guidelines using identified conflicts. Figure 1 illustrates our methodology starting from data collection.

**Figure 1.** The flowchart (right) illustrates our methodology starting from data collection until interannotator agreement (IAA) calculation. The agreement is calculated between 10 RCTs annotated (left) in Dhrangadhariya et al [28] and 9 RCTs from the current work. LLM: large language model; RCT: randomized controlled trial

Expert Team

RoB annotation requires specialized expertise due to the need to thoroughly review the entire RCT text to identify 22 different bias categories. Our annotation team comprised 2 experts in RoB assessment in the physiotherapy and rehabilitation domains: an epidemiology researcher (RH) and an associate professor in physiotherapy (KMS), both with extensive experience in physiotherapy, statistical methods, and SRs. Two senior PhD students (KG and RC) contributed to the development of the visual annotation guidelines. Two researchers in natural language processing, an associate professor in computational linguistics (NN) and a PhD student in computer science (AKD), assisted in creating the visual guidelines, which serve as a benchmark for RoB text span extraction. Finally, JPTH provided feedback to shape the visual annotation guidelines (Multimedia Appendix 1).

Ethical Considerations

In accordance with the Swiss Federal Act on Research involving Human Beings (Human Research Act) of September 30, 2011 (SR 810.30), formal ethical approval by a Cantonal Ethics Committee was not required as the research did not concern human diseases or the structure and function of the human body. The experts who undertook the visual placards development and the annotation process for this corpus were informed about the purpose of the annotation project and agreed to voluntarily participate in the study. Even though they agreed to participate in the study, they can withdraw their participation at any time without consequences of any kind. They were informed of the purpose and nature of the study via a presentation and had an opportunity to ask questions. They were also fully informed about the eventual publication of the findings. Each expert provided consent for the publication of the annotated corpus, along with the understanding that any identifying information would be appropriately anonymized to protect their privacy. The expert annotators volunteered their time and received no financial compensation.

Annotation Scheme

Creating a new annotated corpus requires defining or adopting an annotation scheme. To our knowledge, the only existing RoB span annotation scheme is from our previous work [28]. Rather than developing a new one, we adapted and enhanced this scheme, addressing prior limitations. The scheme aligns with the RoB 2 assessment, which organizes bias into 5 domains reflecting different trial design aspects. Each risk domain decomposes into several signaling questions totaling 22 (Table 1). Each question prompts the assessor to look for relevant text evidence in the trial and judge risk response for that signaling question (SQ; Table 2). For detailed explanations of the SQs, see Cochrane RoB 2.0 guidelines as the original document and Multimedia Appendix 1 [6].

Table 1. The table lists signaling questions from each bias domain from the revised Cochrane Risk of Bias (RoB) tool.

Question number	Signaling question
RoB 1
RoB 1.1	Was the allocation sequence random?
RoB 1.2	Was the allocation sequence concealed until participants were enrolled and assigned to interventions?
RoB 1.3	Did baseline differences between intervention groups suggest a problem with the randomization process?
RoB 2
RoB 2.1	Were participants aware of their assigned intervention during the trial?
RoB 2.2	Were carers and people delivering the interventions aware of participants’ assigned intervention during the trial?
RoB 2.3	Were there deviations from the intended intervention that arose because of the trial context?
RoB 2.4	Were these deviations likely to have affected the outcome?
RoB 2.5	Were these deviations from the intended intervention balanced between groups?
RoB 2.6	Was an appropriate analysis used to estimate the effect of assignment to intervention?
RoB 2.7	Was there potential for a substantial impact (on the result) of the failure to analyze participants in the group to which they were randomized?
RoB 3
RoB 3.1	Were data for this outcome available for all, or nearly all, participants randomized?
RoB 3.2	Is there evidence that the result was not biased by missing outcome data?
RoB 3.3	Could missingness in the outcome depend on its true value?
RoB 3.4	Is it likely that missingness in the outcome depended on its true value?
RoB 4
RoB 4.1	Was the method of measurement of the outcome inappropriate?
RoB 4.2	Could measurement or ascertainment of the outcome have differed between intervention groups?
RoB 4.3	Were outcome assessors aware of the intervention received by study participants?
RoB 4.4	Could the assessment of the outcome have been influenced by knowledge of the intervention received?
RoB 4.5	Is it likely that the assessment of the outcome was influenced by knowledge of the intervention received?
RoB 5
RoB 5.1	Were the data that produced this result analyzed in accordance with a pre-specified analysis plan that was finalized before unblinded outcome data were available for analysis?
RoB 5.2	Is the numerical result being assessed likely to have been selected, on the basis of the results, from multiple eligible outcome measurements (eg, scales, definitions, time points) within the outcome domain?
RoB 5.3	Is the numerical result being assessed likely to have been selected, on the basis of the results, from multiple eligible analyses of the data?

Table 2. The table lists bias domains from the revised Cochrane Risk of Bias (RoB) tool and the number of signaling questions (SQs) per domain.

Class	Domain	SQ
RoB 1	Biases arising from the randomization process	3
RoB 2	Biases due to deviations from intended interventions	7
RoB 3	Bias due to missing outcome data	4
RoB 4	Bias in the measurement of the outcome	5
RoB 5	Bias in the selection of the reported result	3

The response options for the RoB judgment include “Yes,” “Probably yes,” “No,” “Probably no,” or “No information.” Reviewers assess each SQ by examining the factual evidence in the RCT. For example, the SQ “Was the allocation sequence random?” is assessed by checking the randomization method. A well-executed method results in a “Yes” response (low risk), while a poorly executed method leads to a “No” (high risk) [11].

For RoB span annotation, we mimic the assessment process by considering evidence text spans in the RCT as the main units of annotation. Each span corresponds to an answer for a specific SQ and is annotated with a label. Multiple spans (sentence and paragraph) across the RCT can be annotated to answer a single SQ if needed. The annotation label incorporates information about the SQ number and its domain (for the above example, “1.1” for the first domain and first SQ of the domain). The response option for risk judgment is incorporated in the label, such as “1.1 Yes Good” for a well-executed randomization procedure and “1.1 No Bad” otherwise (Figure 2). To improve interannotator agreement (IAA), we collapse “Yes” and “Probably Yes” to a single “Yes,” and similarly for “No” and “Probably No” [28]. As shown in Figure 3, these collapsed responses do not affect the final risk judgment for a domain (low, high, or some concerns). Therefore, except for some special case SQs, we collapse these response options in this work. Our scheme includes 22 entities for the 22 SQs, each with 2 response options (“Yes” or “No”) and 2 risk judgments (“Good” or “Bad”); “Good” implies low risk and “Bad” implies high risk. The “No Information” option is removed unless there’s truly no text evidence, though it remains for specific SQs like SQ 2.1. For instance, if a trial describes a “random number generator and sealed envelopes” but lacks details on envelope opacity, “No Information” is considered an appropriate label.

**Figure 2.** Algorithm for suggested judgment of risk of bias (RoB) arising from the randomization process. The figure is recreated from the revised Cochrane RoB 2 tool (RoB 2) [6].

**Figure 3.** Algorithm for suggested judgment of risk of bias arising from the randomization process. The figure is recreated from the revised Cochrane Risk of Bias 2 tool. N: no; NI: no information; PN: probably no; PY: probably yes; Y: yes [6].

Data Collection

A dataset of 41 RCTs from the physiotherapy and rehabilitation domains was compiled by RH. The RCTs included in the corpus were carefully curated from a selection of high-quality journals in physiotherapy and rehabilitation, as recommended by our institute’s librarian (eg, PLoS). To ensure consistency with modern reporting practices, we included only RCTs published after 2010. To facilitate open sharing and publication of the annotated corpus, we included only articles available under the CC-BY-0 license. Additional details are provided in Multimedia Appendix 1.

To create this corpus, PDFs of full-text RCTs were extracted, and each article was paired with its trial registry whenever available. Each PDF was renamed with the primary outcome to be assessed using the RoB 2 tool before being uploaded to the annotation software. To ensure that various primary outcome types were represented in the corpus, we included 17, 17, and 7 RCTs addressing objective, subjective, and mortality primary outcomes, respectively, following the rationale that RoB assessment results are related to the outcomes [30-33]. The rationale behind this is described in Multimedia Appendix 1.

Visual Placards Development

Although RoB 2 guidelines are widely used for bias assessment, there has been some research on their reliability. Minozzi et al [26,34] addressed this issue by creating an ID that reduces subjectivity in the RoB 2 guidelines, providing clearer instructions for assessment. Before implementing the ID, the agreement among 4 expert RoB assessors in the Minozzi et al’s [26,34] study was zero, but it improved after adopting the ID. Several other papers explored the subjectivity and reliability of Cochrane RoB 1 and 2 tools [35,36]. To enhance consistency and reliability, we developed precise annotation instructions using the RoB 2 tool and collaborated with experts to format these into visual placards. Each placard, structured as a flowchart, guides annotators in answering SQs and labeling text with risk judgments. While RoB 2 SQs are mainly factual, they allow for subjective judgments, which the placards help standardize.

Annotation

The annotation process for the 41 RCTs (see “Data Collection” section) began after developing visual placards. Annotators used the complete RoB 2 guidance alongside these placards, following instructions closely for each SQ. For each SQ, the placards guided annotators to relevant sections within the RCTs, to identify and highlight pertinent text to answer each question, selecting labels as defined in the annotation scheme. Domain 2 of RoB 2 was assessed with respect to the effect of assignment to the intervention (intention-to-treat estimand). Signaling questions related to the “effect of adhering to the intervention” were not annotated.

The annotation was done in Tagtog (tagtog Sp. z o.o.), a commercial PDF annotation tool [37]. Of the 41 RCTs, 9 were doubly annotated by RH and KMS to calculate IAA, with the remaining (n=32) singly annotated by RH. Conflict resolution on the doubly annotated RCTs helped refine the visual placards before annotating the rest. After annotating the 9 RCTs, we transitioned to the PAWLS (PDF Annotation With Labels and Structure) annotation tool (Allen Institute for Artificial Intelligence; Figure 4), a free PDF annotation platform [38]. Annotating PDFs preserves the structure of sections, tables, and figures, improving annotation speed and quality and ease of annotation for our experts who volunteered for annotation. Feedback from them is detailed in Multimedia Appendix 2.

**Figure 4.** A screenshot of PAWLS (PDF Annotation With Labels and Structure; Allen Institute for Artificial Intelligence) interface with an example PDF and risk of bias (RoB) annotations.

IAA Measure

We report IAA on the doubly annotated RCTs. The IAA was calculated at 2 levels, assessing annotator agreement on text spans for SQs using pairwise F1, which excludes unannotated tokens and is well-suited for token-level annotation tasks. Pairwise F1 is measured as shown below for each pair of annotators by treating one annotator’s labels as “true” and the other’s as “predicted” [39,40]. In this study, IAA was measured after incorporating visual placards into the annotation process. To evaluate the impact of these placards on annotation quality, we also compare the F1 IAA with results from our previous work, where n=10 RCTs were annotated without the use of placards (Figure 1).

$F 1 - m e a s u r e = \frac{2 \times T r u e P o s i t i v e}{2 \times T r u e P o s i t i v e + F a l s e P o s i t i v e + F a l s e N e g a t i v e}$

We also check annotator agreement on risk judgments for each SQ using prevalence and bias-adjusted κ (PABAK) and observed percent agreement. PABAK (κ_PABAK), an extension of Cohen κ that accounts for prevalence and bias, is commonly used for classification tasks and is ideal for evaluating reliability at the risk judgment level. Interpretation guidelines for both IAA measures are shown in Table 3 [41-44].

Table 3. Interpretation of pairwise F1-measure, κ_PABAK, and observed agreement.

Metric	Value
Pairwise F1
Poor	0‐0.99
Slight	1‐20.99
Fair	21‐40.99
Good	41‐60.99
Substantial	61‐80.99
Almost perfect	81‐99.99
Perfect	100
κ_PABAK
No agreement	≤0
None to slight	0.0-0.20
Minimal	0.21‐0.39
Weak	0.40‐0.59
Moderate	0.60‐0.79
Strong	0.80‐0.90
Almost perfect	≥0.90
Perfect	100
Observed agreement
None	0%
Very low	1%‐10%
Low	11%‐30%
Moderate	31%‐50%
High	51%‐70%
Very high	71%‐90%
Perfect	>90%

LLM Evaluation

Our annotation guidelines were initially adapted for benchmarking traditional ML approaches rather than LLMs. This meant we restricted certain annotations, assuming the PDFs would be converted into text via optical character recognition, thus losing table and figure structures that classical ML models cannot interpret without significant adjustments [45,46]. Recent advancements with LLMs have offered a better alternative and have made us rethink the evaluation. The bar for clinical applications is high, and it is necessary to evaluate LLMs for the challenging clinical tasks like RoB span extraction [47]. ChatPDF allows direct interaction between LLMs and PDFs, negating the clumsy PDF-to-text conversion [48]. Therefore, we consider it essential to evaluate LLMs instead of forcefully adapting the evaluation to a classical ML problem.

We formulated the task as a zero-shot RoB text span extraction task with an aim to gauge whether an LLM encodes knowledge related to assessing trial biases. We used simple prompt constructs of the structure “Answer the {SQ} + Action item to extract sentence supporting the answer” (Textbox 1). ChatPDF used these prompts to identify relevant paragraphs and generate answers to the SQs using GPT-3.5, mirroring the annotators’ task.

Textbox 1. Example prompt used for large language model evaluation.

Question 4.3: Were outcome assessors aware of the intervention received by study participants? Provide an answer and extract the supporting sentences that you write your answer based on. Extract the sentences in JSON format.

LLM performance was measured by agreement on response options and extracted text span or evidence. If the LLM’s response matched the expert annotator’s selection, it was considered correct. If the text extracted by LLMs as evidence for answering the SQ fuzzy matches the text selected by the expert annotator, it is considered a correct answer. Both skills were evaluated using observed agreement metrics, $P_{O} (E x t r a c t i o n)$ for measuring agreement over extraction and $P_{O} (R e s p o n s e)$ for measuring agreement over response judgments and interpreted as per Table 3. Observed agreement is essentially the number of documents for an RoB SQ where LLM responses align with those of the human expert, divided by the total number of documents assessed [49]. For cases where annotators found no information in RCT, ChatPDF’s ability to recognize this absence was also evaluated. We set the temperature to 0 to ensure a deterministic setting for span extraction and response generation. This ensures exact text spans are extracted from the input RCT. This evaluation was done manually for 10 out of the 41 annotated RCTs. Details of the RCTs used for LLM evaluation are in Multimedia Appendix 1.

Visual Placards

A total of 27 placards were developed to address the 22 SQs. Details of the annotation guidelines and visual placards are available in Multimedia Appendices 1 and 3. Figure 5 shows an example placard for annotating SQ 3.1 (“Were data for this outcome available for all, or nearly all participants randomized?”), which assesses the completeness of outcome data in an RCT. Missing outcomes data can compromise statistical power and treatment effect estimates. The first diamond on the placard instructs annotators to check the “Results” section (priority indicated by green arrow) and the flowchart and table within the “Results” section (second priority) to identify outcome data at the specified time point. If outcomes data were available for at least 95% of participants, annotators mark relevant text descriptions as “3.1 Yes Good,” indicating a low bias. If data were available for fewer than 95%, they mark it as “3.1 No Bad,” indicating a high bias. The placard includes visual cues to guide the annotation process efficiently.

**Figure 5.** Sample annotation instruction placard for the signaling question (SQ) 3.1 designed and adapted using the Cochrane Risk of Bias (RoB) 2 tool [6].

The Corpus: RoBuster

Figure 6 shows that SQ 1.3 had a much higher number of annotated tokens (n=16,446), compared to fewer than 2600 tokens for other SQs. This is because SQ 1.3 required annotating the entire baseline patient characteristics table, as instructed by the visual placards. For other questions, the number of annotated tokens depended on the amount of detail provided on study design, methods, and results, which affected both annotation count and assessment subjectivity. Most other SQs had fewer than 2000 annotated tokens, with SQ 3.1 slightly exceeding this threshold. SQ 2.4 had the fewest tokens, with only 25 tokens.

**Figure 6.** Total number of token annotations in RoBuster for each risk of bias (RoB) signaling question (SQ).

Figure 7, which shows the distribution of risk judgments across RoBuster, highlights that, for most SQs, no information was available (yellow bars) for answering the SQs. Exceptions included SQs 1.1, 1.2, 1.3, 2.2, 2.6, 3.1, 4.3, 4.4, and 5.2, where over 50% of documents had relevant information. In cases where even some information was available, bias tended to be low (green bar) with an exception for the SQs 2.1, 2.2, 3.1, 4.3, and 4.4, where bias was high (red bar). Studies with comprehensive information made evaluation easier, whereas those lacking key details made it challenging. Annotator feedback indicated that questions with fewer than 100 annotated tokens were consistently rated as “(very) low” information availability, whereas the top 5 SQs shown in Figure 6 were rated as “high” or “normal” availability. All study references are listed in Multimedia Appendix 1.

**Figure 7.** Distribution of bias judgment across risk of bias (RoB) signaling questions (SQs) in RoBuster.