Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Recommendations for the integration of generative artificial intelligence in support of engineering education research workflows
0
Zitationen
3
Autoren
2025
Jahr
Abstract
Generative artificial Intelligence (AI) is a disruptive technology that is altering the way we live and work, and disruptions in academia are not the exception. Specifically in academia, generative AI is making its way into research, the classroom, and administrative workflows. For instance, recent work has documented how generative AI can support the creation of systematic literature reviews (Gwon et al., 2024; Hossain, 2024; Mozelius & Humble, 2024; among others) and even automate scientific workflows (e.g., Ghafarollahi & Buehler, 2025). As generative AI becomes increasingly integrated into research processes, it is essential to demonstrate its efficacy and accuracy—while also safeguarding human agency (Watkins & Barak-Medina, 2024)—in mimicking human activities (Miao & Holmes, 2023). In tandem, it is equally important to identify what are acceptable and perhaps controversial uses of generative AI in support of research workflows among discipline-specific research communities (Andersen et al., 2025; Watkins, 2023). For instance, a study identified that Danish researchers from varied disciplines hold both positive and negative views on the use of generative AI in support of research workflows. Although uses in data analysis may be viewed positively, uses for experimental design may be viewed as controversial (Andersen et al., 2025). Furthermore, for the case of qualitative analysis, which is common in engineering education research, as you will read in this special issue, researchers may be exposed to a variety of ethical dilemmas, such as data ownership and rights, data privacy and transparency, interpretive sufficiency, and potential biases, among others (Davison et al., 2024). As a response to this technological disruption, this special issue contributes new knowledge of how generative AI can be used and is being used in support of engineering education research workflows. In May of 2024, we invited the engineering education research community and other STEM education disciplines to contribute manuscripts that showcase ways in which generative AI can be used to support engineering education research workflows. In response, we received 43 extended abstracts with proposals. After an initial review by all three editors, 19 proposals were invited for manuscript submissions, and a total of 13 full manuscripts were received and sent out to peer review. The final collection of seven published manuscripts showcases how generative AI can be used for supporting data collection, data scoring, and data analysis processes, as shown in Figure 1. In each of the contributions, authors were asked to consider the validity and reliability of the AI-generated results, aspects of reproducibility and transparency, and overall benefits and limitations of the use of generative AI. Authors were also highly encouraged to explicitly address issues of ethics, bias, privacy, and security considerations, as applicable. Furthermore, each of the studies delivers not only important methodological contributions for the use of generative AI in research workflows but also theoretical contributions in engineering education research, as is the standard for the Journal of Engineering Education (JEE). In the following sections, we elaborate on the contributions from each study as indicated in Figure 1. To study the potential benefits and risks of using large language models (LLMs) to generate qualitative research data, Sanders et al. (2025) conducted a structured two-phase investigation that compared AI-generated conversational text with interview data, which had been previously collected from 24 engineering faculty members from 17 institutions and 14 undergraduate students from a single institution. Drawing on prompt-engineering strategies and the development of student and faculty personas, the authors generated responses from ChatGPT-4 to the same kinds of questions posed in the human interviews; for example, “What comes to mind if I asked you to describe what you think a ‘culture of wellness’ would look like in your department?” Their central goal was to explore the affordances and limitations of LLMs for qualitative inquiry and to examine how these generated narratives might reproduce or resist dominant cultural stories surrounding high stress in engineering programs. The analysis was guided by the “Idealized Worker” framework, which argues that organizational structures often reinforce conformity to a privileged archetype (e.g., a White, able-bodied, straight, cisgender man). This lens allowed the researchers to critically assess whether the LLM outputs perpetuated narrow exclusionary norms of who belongs in engineering. They found that AI-generated content frequently paralleled human responses; 7 of 30 questions showed high similarity, and 20 showed moderate similarity, but it also tended to stereotype experiences and lacked the nuance and variability present in lived perspectives. Moreover, while the LLM responses more readily proposed structural changes to reduce stress, human participants more often emphasized personal or interpersonal approaches. These findings highlight both the potential and the limitations of LLMs in qualitative research. On one hand, the models served as useful brainstorming tools for developing interview protocols and generating novel ideas for systemic change. On the other hand, biases in the training data led to limited perspectives that reinforced the idealized-worker narrative, and the stochastic nature of the technology meant that small prompt changes could produce markedly different outputs, complicating process reliability and procedural validation. The study offers methodological guidance for researchers: LLMs can support critical exploration and idea generation, but their reliance on broad, public training data requires careful scrutiny and points to the need for future work using models trained specifically on engineering education data. Mburu et al. (2025) explored the use of LLMs to generate adaptive, contextually and personally relevant survey questions, thus falling into the category of innovations in data collection approaches. The researchers presented a step-by-step method for developing a dynamic, AI-driven survey instrument and introduced the synthetic question-response analysis (SQRA) framework to evaluate AI-generated questions before involving human participants. Using activity theory as a theoretical lens, the authors examined the interactions between AI-generated content and survey respondents. Their findings indicate that while AI-generated questions effectively incorporated course-specific references and adapted to the context, several issues were identified. Specifically, the authors found that a common problem involved the presence of double-barreled questions, which could compromise the reliability and clarity of the data. Also, the AI often produced redundant phrasing, repeating similar structures and vocabulary, which risked lowering student engagement and limiting response depth. The questions were also frequently too lengthy and occasionally included implicit evaluations, such as affirmations or personal judgments. Another concern was the use of jargon lifted directly from the prompt, which could hinder participant understanding. To address the identified issues, the researchers iteratively refined the system prompt using structured guidelines from Walther et al.'s (2017) quality framework for qualitative research. These revisions focused on minimizing redundancy, eliminating double-barreled questions, and removing implicit evaluative language. As a result of this multi-round refinement process, the clarity, conciseness, and overall quality of the AI-generated questions improved significantly, aligning more closely with established best practices in survey design. Based on the findings, the authors concluded that the SQRA framework, despite its limitations in simulating human response variability, proved helpful in refining question quality through iterative feedback. The study concluded that AI-driven question generation offers promising benefits for scalable and personalized survey design, but emphasized the need for further research, consideration of ethical implications, and methodological innovation to ensure the development of trustworthy AI tools in educational research. Drinkwater et al. (2025) investigate the reliability and utility of generative AI to apply a feedback quality rubric to 295 peer feedback comments written by first-year engineering students in a project-based learning course. These comments, collected through a web-based tool (CATME) three times during the semester, were analyzed using a four-criterion rubric (i.e., Task, Behavior, Gap, Action), adapted for engineering contexts. The study's dual goals were to evaluate the reliability of the LLM's rubric application and to explore what the results reveal about students' feedback literacy. Methodologically, the study followed a matched mixed-methods design, using both human raters and an LLM to code the same dataset. Researchers first refined the rubric through iterative rounds of coding to improve inter-rater reliability (IRR), and then compared the LLM's ratings using Cohen's quadratic weighted kappa. To ensure transparency and mitigate bias, the authors employed prompt-engineering “best practices” (e.g., persona setting, chain-of-thought reasoning, XML tagging) and conducted automated de-identification of peer comments to remove gendered language and names. Their model selection process balanced technical performance, logical coherence, and accessibility, ultimately choosing the qwen-2.5-32b model. The authors addressed common concerns around AI-assisted qualitative analysis, including reliability, replicability, demographic bias, and the environmental cost of model training. They emphasized that while LLMs can accelerate coding at scale, they often mimic novice-level raters and require extensive prompt testing and quality checks. Notably, both human raters and the model struggled most with subjective criteria, underscoring the complexity of interpreting vague or brief peer comments. The analysis of student comments revealed that most feedback was vague, overly positive, and lacked actionable suggestions, highlighting limited student feedback literacy. A typology of five comment types was developed to illustrate these trends, ranging from low-quality generic praise to rich constructive feedback. Fuchs et al. (2025) investigate the value of synthetic data for applying LLMs to evaluate feedback provided to engineering students. The goal was to develop and validate a framework for training sentence transformers using generative AI–created synthetic data to categorize student feedback interactions in engineering studios. To achieve this, the authors de-identified and transcribed eight real-world engineering studio conversations and then used them as foundations to generate synthetic feedback transcripts with three locally hosted open-source LLMs, namely Llama 3.1, Gemma 2.0, and Mistral NeMo, adjusting parameters such as temperature. This process produced three synthetic datasets of engineering conversations that included feedback interactions, which were assessed through a methodological approach based on the SynEval Framework, combining human evaluation and computational data-variance checks. These synthetic interactions were subsequently used to train a sentence-transformer model (SetFit) alongside the real data, allowing the researchers to compare its classification performance to human coding. Guided by Kluger and DeNisi's feedback intervention theory (FIT) and Pekrun's control value theory of achievement emotions, the team created a unified real-time codebook to characterize both cognitive and emotional responses of students to feedback. The findings show that synthetic data can substantially enhance natural language processing (NLP)-assisted qualitative analysis: training with Llama 3.1 synthetic data improved SetFit's accuracy in distinguishing directive from facilitative feedback from 68.4% to 81%. While synthetic data offers a powerful way to expand qualitative research in contexts where real NLP training data are scarce, the authors note drawbacks such as occasional extraneous details and missed instructor-dominant discourse. By using locally hosted LLMs to protect privacy and ensure secure processing, this study demonstrates the opportunities and limitations of employing synthetic data for real-time evaluation of feedback interactions and provides a replicable methodology for research at scale. Auby et al. (2025) present a human-centered AI approach for analyzing student understanding through short-answer justifications to conceptually challenging questions in engineering mechanics and thermodynamics. Recognizing the practical challenge instructors face in evaluating student reasoning at scale, the authors apply the cognitive resources framework, seen as context-activated “chunks” of knowledge, to examine the thought processes underlying student responses. The background and conceptual framework sections set the stage for a rigorous study by conducting a thorough analysis of existing LLM technologies for this type of analysis and making the case for a human-centered AI approach. Then, through qualitative human coding of seven concept questions, the study first identifies key elements of student reasoning, aiming at characterizing the different ways student writing reflects their understandings. The authors then evaluated several models, providing important details of how they implement and assess these, and concluded that Mixtral and Llama-3 showed the best performance within the same dataset, while GPT-4 and GPT-4o-mini showed better performance on generalization tasks. Their findings reveal that Mixtral and Llama-3 excel within the dataset, while GPT-4 and GPT-4o-mini demonstrate stronger generalization to unseen data. This work provides evidence-based suggestions on how to use specific generative models for a given context. This study also provides important contributions by using the resources framework to understand students' short answers and how the cognitive resources they used related to their ability to answer the concept questions correctly. Osunbunmi et al. (2025) report on research exploring how machine learning (ML) models and explainable AI (XAI) assessments can improve predictions of undergraduate engineering student retention. Using a 10-year dataset (Fall 2007–Fall 2016) from the College of Engineering at their institution and comprising more than 16,000 observations, the authors compared a variety of dimensionality-reduction methods (e.g., forward, backward, and unidirectional stepwise regression), regularization techniques (LASSO, Ridge), and predictive modeling approaches (k-nearest neighbors, logistic regression, decision trees, artificial neural networks, gradient boosting, and random forest). They also applied XAI methods such as LIME, SHAP, and Mutual Information to surface key factors influencing model results. Guided by Tinto's integrated model of student departure, complemented by Astin's student involvement framework, the study examined predictors including pre-college preparation (e.g., SAT scores), academic performance in early core courses, demographic characteristics, and engagement in co-curricular activities. The authors found that academic variables, especially GPA in the first 2 years and SAT math scores, were the strongest predictors of persistence, with demographic factors and co-curricular engagement contributing but less strongly. Random forest emerged as the best performing model, followed by gradient boosting and artificial neural networks. By combining numerous ML techniques with the interpretability of XAI, the research demonstrates computational approaches that not only outperform traditional statistical methods in handling large, complex datasets but also provide transparent, actionable insights for institutions seeking to support engineering student persistence and graduation. Ross and Katz (2025) use generative AI to investigate the persistent issue of attrition in computer science (CS) and the factors influencing career changes across various stages and contexts, guided by two key questions: (1) What are the reasons for leaving? and (2) What external factors influence these decisions? The authors conducted a large-scale qualitative study by collecting over 10,000 Reddit posts through keyword-based scraping and refining them to 263 relevant posts using generative AI for efficient data filtering. They then applied the generative AI–enabled theme organization and structuring (GATOS) workflow for AI-assisted thematic analysis, which mimics human qualitative analysis steps: reading raw data, summarizing it into distinct ideas with the Qwen2.5-32b model, identifying semantically similar ideas, creating the codebook iteratively (including k-nearest neighbors and cosine similarity matching), and simplifying the codebook into broader themes. This process was extended with a rigorous “human-in-the-loop” approach to ensure accuracy, depth, and contextualization of the AI-generated themes. Social cognitive career theory (SCCT) provided the theoretical lens for interpreting and mapping the AI-generated factors to SCCT constructs (interest, choice, performance, and satisfaction). The analysis revealed six core reasons for departure, namely job dissatisfaction, academic struggles, psychological and emotional factors, interests in other fields, health concerns, and broader industry issues. The analysis also identified influential external factors, including background and preparation, transition requirements, the nature of alternative fields, and personal circumstances. Notably, these reasons and factors appeared across all stages of departure, although their impact varied. The authors emphasize both the advantages of generative AI—such as unprecedented scalability and the ability to generate more generalizable insights—and the ethical considerations of privacy, bias, and AI accuracy. They addressed these concerns by using only publicly available Reddit data and maintaining human oversight for continuous validation. A significant contribution of this study is the demonstration of a replicable, scalable methodology for large-scale qualitative research using generative AI and NLP-assisted methods, offering a model for future studies that face the challenge of limited datasets for NLP training. The collective body of work from this special issue demonstrates that generative AI and LLMs have the potential to profoundly assist engineering education research methodologies, primarily by enabling scalable, data-driven qualitative analysis. These advanced computational techniques offer a viable means to address the technical challenges of limited datasets and accelerate traditionally processes like rubric data scoring, and thematic analysis, among In this and for aligning with the and of the it is critical that authors use existing to their research methods the study uses and of Generative AI and specifically LLMs, will provide an to the provided but an not it is an answer to a research that the is what your data can or that the research question was useful to in the first for contributing to new knowledge in engineering education research, it is important to have a between data, and of findings theoretical support from the the findings, and or in this are of the and not the views of the is the in in the of and and in the of Engineering Education at research how in and can be better through tools and such as data science and In received the for modeling and practices in undergraduate engineering is a of the for Engineering Education in of contributions to engineering education or engineering technology education and contributions to Watkins, is a at the in is the of the faculty of the and of Education for the AI has and over and is the of several and an As an how we and with technologies in both the and in studies and AI. is an in the of Education at a in Engineering and a in Engineering from and a in from a at and has been at In was a in the of and Education at the of how to support student complex learning and how instructors complex in also how to use computational methods to understand educational
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.