The death knell for the home exam?
AI-generated answers go undetected and get higher grades. That is the conclusion of a new study conducted at the University of Reading.
The study in question was recently published in the journal PloS One. Permission was granted to researchers at the University of Reading to create 33 fictitious student accounts on the BSc psychology programme. These ‘students’ took part in online exams; however, their answers were generated by ChatGPT-4. There were two types of exam: Short Answer Questions which consisted of four answers of 200 words submitted in a 2.5-hour window and Essay Based Questions, where submission consisted of one 1500-word essay with an 8-hour time limit.
The researchers used two standard prompts to GPT-4 to produce the answers for each type of exam. For the short answers the prompt was:
Including references to academic literature but not a separate reference section, answer the following question in 160 words
For essay-based answers the prompt was:
Including references to academic literature but not a separate reference section, write a 2000 word essay answering the following question
The prompts were followed by the actual exam question.
The reader may have noticed that whilst the questions asked for 200 and 1500 words, the prompts ask for 160 and 2000 words. This is because ChatGPT tended to write too many words for the short answers and too few for the essay answers. In fact, the authors encountered some problems with getting essays that were long enough. This was solved by asking ChatGPT to “please continue your answer”. The resulting text was simply added to the previous text without any attempt to match the pieces together.
The number of AI students per exam was limited to approximately 5% and since this meant extra work, funding was provided for extra markers. Before marking the modules, a calibration meeting was held for the whole marking team in a method similar to that used in many departments here at SU. This involves a small number of essays being graded by the full marking team and the grades compared and discussed. Exams were run on two separate platforms, with plagiarism detection software available. Markers were asked to flag any strange or suspicious answers.
In order to avoid detection by the plagiarism tools, the authors could not submit the same answer twice. Here, ChatGPT was simply asked to regenerate the previous answer. In all cases, the submitted text was taken directly from ChatGPT with the only change being the deletion of Open AI’s disclaimer.
The researchers found that hardly any of their AI-generated submissions were flagged as suspicious, with 94% passing through the marking stage unnoticed. On average, grades achieved by AI submissions were just over half a grade higher than those achieved by real students, though this varied across modules.
The authors conclude that it is not only feasible for students to submit AI answers—by doing so they also increase their chances of a higher grade. This is particularly worrying since in the study only ‘raw’ AI data was submitted. A more likely scenario would be for students to modify AI-generated text before submission.
Comment: A year-and-a-half ago I wrote an article for this newsletter entitled “Who’s afraid of ChatGPT?”. In that article I suggested that the main issue for assessment was that lecturers were reluctant to envisage other types of examination than written essays, such as oral exams, or video or audio submissions of students discussing their essays or metacognitive reflection on their writing process. Going forward it is much more interesting to speculate on the way in which AI can be integrated into our courses transforming our focus. In my previous article, I pointed out that:
In many undergraduate courses we have traditionally spent a lot of time and energy on helping students with the mechanics of producing texts. We discuss the layout of lab reports, the structure of introductions in essays, the importance of presenting counter-arguments in argumentative texts, etcetera. All of this is what language models such as ChatGPT do extremely well.
Many have suggested that the answer to ChatGPT’s superiority is a return to in-person, formal examinations with invigilators. However, I believe that is knee-jerk reaction to the problem. The main issue as I see it is that in formal written examinations, we are creating an artificial world where ChatGPT does not exist. But in the real world ChatGPT does exist and we would expect our students to be able to use it. One answer could be to ask students to keep records of how they have used ChatGPT in an iterative process to create their final essay. But there are also other forms of assessment that we could be using instead of written exams.
I asked ChatGPT for a list of alternative assessment methods and it came up with sixteen different forms that could be used instead of a written, hand-in task (repeated verbatim below):
1. Oral Presentations
- Rationale: Students must present their ideas or research verbally, demonstrating understanding, communication skills, and the ability to think on their feet.
- Variation: Use live Q&A sessions to ensure students understand the content rather than memorizing a script.
2. In-Class Debates or Discussions
- Rationale: Encourage students to defend their positions in real-time, requiring them to think critically, collaborate, and apply knowledge.
- Variation: Assign roles (e.g., for and against) to ensure a balanced discussion and deeper engagement with the material.
3. Project-Based Learning
- Rationale: Students work on long-term projects, often with multiple steps that must be completed and presented at different stages.
- Examples: Research projects, building prototypes, designing experiments, or solving real-world problems. The process is just as important as the final product.
4. Portfolios
- Rationale: A portfolio allows students to showcase a range of skills and projects they’ve worked on over a period, emphasizing growth, reflection, and creativity.
- Structure: Include reflection essays, drafts, and a log of progress to ensure the student’s active involvement in creating the final product.
5. Group Work and Collaborative Projects
- Rationale: Working in teams makes it harder to rely solely on AI-generated content since students must communicate, divide tasks, and contribute uniquely to the project.
- Assessment: Evaluate both the final product and the group dynamic, peer feedback, and individual contributions.
6. Reflective Journals or Logs
- Rationale: These require students to document their thought process and learning experiences over time, focusing on personal growth rather than just producing an output.
- Variation: Use guided prompts that ask for analysis of class material, personal responses, or how students approach challenges.
7. Case Studies and Problem-Based Assessments
- Rationale: Present students with real-world problems or scenarios, requiring them to apply what they’ve learned. This emphasizes problem-solving and application rather than just theoretical knowledge.
- Variation: Include multiple stages, such as research, solution proposal, and justification.
8. Poster Sessions or Infographics
- Rationale: Students synthesize information into a visual format, demonstrating understanding in a creative way. Presenting the poster or infographic orally can reinforce learning.
- Variation: Add a peer review component to foster collaborative feedback.
9. Practical or Hands-On Assessments
- Rationale: In fields like sciences, arts, and technical subjects, practical assessments (labs, performances, or demonstrations) can directly measure a student’s applied skills.
- Examples: Conducting an experiment, creating an artwork, or coding a project.
10. Viva Voce (Oral Exams)
- Rationale: Oral exams require students to explain concepts or answer questions in real time, demonstrating their depth of knowledge and ability to articulate ideas under pressure.
- Variation: Use structured, semi-structured, or unstructured formats, depending on the subject.
11. Video Presentations or Recorded Demonstrations
- Rationale: Students create video presentations of their work, which adds a creative and personal dimension while allowing the instructor to assess their ability to present and communicate ideas clearly.
- Variation: For certain subjects, students could record themselves teaching a concept to peers or demonstrating a skill.
12. Learning Diaries
- Rationale: Students maintain a regular log of what they’ve learned, how they’ve applied it, and reflections on their development. This ongoing assessment fosters continuous engagement with course material.
- Assessment: Focus on the student's ability to self-assess and articulate their learning journey.
13. Competency-Based Tasks or Simulations
- Rationale: Set up practical, real-world scenarios or role-plays (e.g., mock trials, business pitches, or medical simulations) to assess students’ ability to apply knowledge in action.
- Variation: Could include time-sensitive tasks that require quick thinking, preparation, and execution.
14. Peer Assessment and Peer Teaching
- Rationale: Students assess each other’s work or even teach sections of the course to their peers, fostering engagement, understanding, and accountability.
- Variation: Scaffold peer assessments with rubrics to guide the evaluation process.
15. Open-Book or Take-Home Exams with Complex Questions
- Rationale: Use higher-order, application-based questions that require more than a simple AI-generated response, such as asking for reflections, comparisons, or detailed solutions to multi-step problems.
- Assessment: Focus on the process and reasoning behind the answers.
16. Role-Playing or Scenario-Based Tasks
- Rationale: Students take on specific roles or personas in given scenarios (e.g., acting as a historical figure, CEO, or policymaker) to explore perspectives and solve problems from that angle.
- Variation: Combine this with a reflective analysis or presentation.
At the moment SU policy is somewhat unclear on the use of AI. The Department of Computer and Systems Sciences has however, produced its own guidelines (in Swedish).
It seems clear that AI will form a major part of our future work in universities, similar to how calculators and computers became integral to doing math and science. But for now, many will regret the end of the home exam.
Text: John Airey, Department of Teaching and Learning
Keywords: Exams, ChatGPT, AI, Turing test
Senast uppdaterad: 29 oktober 2024
Sidansvarig: Centrum för universitetslärarutbildning