Dec 1, 2023
https://hai.stanford.edu/news/researchers-use-gpt-4-generate-feedback-scientific-manuscripts
https://arxiv.org/abs/2310.01783
Two episodes ago I shared the news that for some major scientific publications, it's okay to write papers with ChatGPT, but not to review them. But…
Combining a large language model and open-source peer-reviewed scientific papers, researchers at Stanford built a tool they hope can help other researchers polish and strengthen their drafts.
Scientific research has a peer problem. There simply aren’t enough qualified peer reviewers to review all the studies. This is a particular challenge for young researchers and those at less well-known institutions who often lack access to experienced mentors who can provide timely feedback. Moreover, many scientific studies get “desk rejected” — summarily denied without peer review.
James Zou, and his research colleagues, were able to test using GPT-4 against human reviews 4,800 real Nature + ICLR papers. It found AI reviewers overlap with human ones as much as humans overlap with each other, plus, 57% of authors find them helpful and 83% said it beats at least one of their real human reviewers.
https://dl.acm.org/doi/pdf/10.1145/3616961.3616992
Oz Buruk, from Tampere University in Finland, published a paper giving some really solid advice (and sharing his prompts) for getting ChatGPT to help with academic writing. He uncovered 6 roles:
He includes examples of the results, and the prompts he used for it. Handy for people who want to use ChatGPT to help them with their writing, without having to resort to trickery
https://www.sciencedirect.com/journal/machine-learning-with-applications/articles-in-press
This is a journal pre-proof from the Elsevier journal "Machine Learning with Applications", and takes a look at how ChatGPT might impact assessment in higher education. Unfortunately it's an example of how academic publishing can't keep up with the rate of technology change, because the four academics from University of Prince Mugrin who wrote this submitted it on 31 May, and it's been accepted into the Journal in November - and guess what? Almost everything in the paper has changed. They spent 13 of the 24 pages detailing exactly which assessment questions ChatGPT 3 got right or wrong - but when I re-tested it on some sample questions, it got nearly all correct. They then tested AI Detectors - and hey, we both know that's since changed again, with the advice that none work. And finally they checked to see if 15 top universities had AI policies.
It's interesting research, but tbh would have been much, much more useful in May than it is now.
And that's a warning about some of the research we're seeing. You need to really check carefully about whether the conclusions are still valid - eg if they don't tell you what version of OpenAI's models they’ve tested, then the conclusions may not be worth much.
It's a bit like the logic we apply to students "They’ve not mastered it…yet"
https://www.jmir.org/2023/1/e49368/
They looked at 160 papers published on PubMed in the first 3 months of ChatGPT up to the end of March 2023 - and the paper was written in May 2023, and only just published in the Journal of Medical Internet Research. I'm pretty sure that many of the results are out of date - for example, it specifically lists unsuitable uses for ChatGPT including "writing scientific papers with references, composing resumes, or writing speeches", and that's definitely no longer the case.
https://ajue.uitm.edu.my/wp-content/uploads/2023/11/12-Maria.pdf
This paper, from a group of researchers in the Philippines, was written in August. The paper referenced 37 papers, and then looked at the AI policies of the 20 top QS Rankings universities, especially around academic integrity & AI. All of this helped the researchers create a 3E Model - Enforcing academic integrity, Educating faculty and students about the responsible use of AI, and Encouraging the exploration of AI's potential in academia.
https://arxiv.org/ftp/arxiv/papers/2311/2311.02499.pdf
If you're keeping track of the exams that ChatGPT can pass, then add to it linguistics exams, as these researchers from the universities of Zurich & Dortmund, came to the conclusion that, yes, chatgpt can pass the exams, and said "Overall, ChatGPT reaches human-level competence and performance without any specific training for the task and has performed similarly to the student cohort of that year on a first-year linguistics exam" (Bonus points for testing its understanding of a text about Luke Skywalker and unmapped galaxies)
And, I've left the most important research paper to last:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4641653
Researchers at University of Toronto and Microsoft Research have published a paper that is the first large scale, pre-registered controlled experiment using GPT-4, and that looks at Maths education. It basically studied the use of Large Language Models as personal tutors.
In the experiment's learning phase, they gave participants practice problems and manipulated two key factors in a between-participants design: first, whether they were required to attempt a problem before or after seeing the correct answer, and second, whether participants were shown only the answer or were also exposed to an LLM-generated explanation of the answer.
Then they test participants on new test questions to assess how well they had learned the underlying concepts.
Overall they found that LLM-based explanations positively impacted learning relative to seeing only correct answers. The benefits were largest for those who attempted problems on their own first before consulting LLM explanations, but surprisingly this trend held even for those participants who were exposed to LLM explanations before attempting to solve practice problems on their own. People said they learn more when they were given explanations, and thought the subsequent test was easier
They tried it using standard GPT-4 and got a 1-3 standard deviation improvement; and using a customised GPT got a 1 1/2 - 4 standard deviation improvement. In the tests, that was basically the difference between getting a 50% score and a 75% score.
And the really nice bonus in the paper is that they shared the prompt's they used to customise the LLM
This is the one paper out of everything I've read in the last two months that I'd recommend everybody listening to read.
https://policycommons.net/artifacts/8245911/about-1-in-5-us/9162789/
Some research from the Pew Research Center in America says 13% of all US teens have used it in their schoolwork - a quarter of all 11th and 12th graders, dropping to 12% of 7th and 8th graders.
This is American data, but pretty sure it's the case everywhere.
Their Generative AI call for evidence had over 560 responses from all around the education system and is informing UK future policy design. https://www.gov.uk/government/calls-for-evidence/generative-artificial-intelligence-in-education-call-for-evidence
One data point right at the end of the report was that 78% of people said they, or their institution, used generative AI in an educational setting
The goal for more teachers is to free up more time for high-impact instruction.
Respondents reported five broad challenges that they had experienced in adopting GenAI:
• User knowledge and skills - this was the major thing - people feeling the need for more help to use GenAI effectively
• Performance of tools - including making stuff up
• Workplace awareness and attitudes
• Data protection adherence
• Managing student use
• Access
However, the report also highlight common worries - mainly around AI's tendency to generate false or unreliable information. For History, English and language teachers especially, this could be problematic when AI is used for assessment and grading
There are three case studies at the end of the report - a college using it for online formative assessment with real-time feedback; a high school using it for creating differentiated lesson resources; and a group of 57 schools using it in their learning management system.
The Technology in Schools survey
The UK government also did The Technology in Schools survey which gives them information about how schools in England specifically are set up for using technology and will help them make policy to level the playing field on use of tech in education which also brings up equity when using new tech like GenAI.
https://www.gov.uk/government/publications/technology-in-schools-survey-report-2022-to-2023
This is actually a lot of very technical stuff about computer infrastructure but the interesting table I saw was Figure 2.7, which asked teachers which sources they most valued when choosing which technology to use. And the list, in order of preference was:
My take is that the thing that really matters is what other teachers think - but they don't find out from social media, magazines or websites
And only 1 in 5 schools have an evaluation plan for monitoring effectiveness of technology.
And in Australia, two researchers - Jemma Skeat from Deakin Uni and Natasha Ziebell from Melbourne Uni published some feedback from surveys of university students and academics, and found in the period June-November this year, 82% of students were using generative AI, with 25% using it in the context of university learning, and 28% using it for assessments.
One third of first semester student agreed generative AI would help them learn, but by the time they got to second semester, that had jumped to two thirds
There's a real divide that shows up between students and academics.
In the first semester 2023, 63% of students said they understood its limitations - like hallucinations and 88% by semester two. But in academics, it was just 14% in semester one, and barely more - 16% - in semester two
22% of students consider using genAI in assessment as cheating now, compared to 72% in the first semester of this year!! But both academics and students wanted clarify on the rules - this is a theme I've seen across lots of research, and heard from students
The Semester one report is published here: https://education.unimelb.edu.au/__data/assets/pdf_file/0010/4677040/Generative-AI-research-report-Ziebell-Skeat.pdf
Published 20 minutes before we recorded the podcast, so more to come in a future episode:
The Framework supports all people connected with school education including school leaders, teachers, support staff, service providers, parents, guardians, students and policy makers.
The Framework is based on 6 guiding principles:
The Framework will be implemented from Term 1 2024. Trials consistent with these 6 guiding principles are already underway across jurisdictions.
A key concern for Education Ministers is ensuring the protection of student privacy. As part of implementing the Framework, Ministers have committed $1 million for Education Services Australia to update existing privacy and security principles to ensure students and others using generative AI technology in schools have their privacy and data protected.
The Framework was developed by the National AI in Schools Taskforce, with representatives from the Commonwealth, all jurisdictions, school sectors, and all national education agencies - Educational Services Australia (ESA), Australian Curriculum, Assessment and Reporting Authority (ACARA), Australian Institute for Teaching and School Leadership (AITSL), and Australian Education Research Organisation (AERO).
________________________________________
TRANSCRIPT For this episode of The AI in Education Podcast
Series: 7
Episode: 5
This transcript was auto-generated. If you spot any important errors, do feel free to email the podcast hosts for corrections.
Hi, welcome to the AI education podcast. How are you? Ray,
I am great. Dan, do you know what? Another amazing two weeks of
news. I can't keep up. Can you?
You know, there's so much research and news that's happening. Even
this morning, we've got to release the AI framework, which we touch
on a little a little bit later.
I know another one.
Oh my word. Okay. Well, look, compared to the world of news, the
world of academic research has been moving a bit slower. Thank
goodness. There have been again another 200 papers produced in the
last few weeks. So, hey Dan, can I do the usual and run you down my
top 20 of the interesting research
that I've read? So,
really interesting one about generating feedback on scientific
manuscripts. You remember Dan, I said that publications were
allowing researchers to write papers now with chat GPT, but they
weren't allowed to review them.
Yes.
Another bunch of researchers did the research and way what they did
was they built a tool that reviewed papers to help researchers kind
of polish and fin and finalize their their final drafts. And the
answer was it was really useful especially young researchers who
can't get professional reviewers to review their manuscripts. and
their papers really useful for them to be able to get feedback on
it. The interesting thing is they asked the re researchers did this
AI help you to produce a better paper. 57% said it it found the
feedback helpful and 83% so that's four in five Dan. We'll come
back to four and five said that it beats at least one of the real
human reviewers that they had look at papers. So that that's really
interesting. The second bit research was about using chat GPT to
help with academic writing. So that isn't help in the sense of
getting it to rewrite things for you or sorry write things for you
originally. The kind of help that I have which is help I've got a
blank page but in terms of being able to find roles for chat GPT to
help make writing more effective. Really interesting research
because he talked about six key roles. Uh a chunk stylist, you
know, help me rewrite this bit of it. a bullet to paragraph
stylist. Here's the five bullet points. Now, turn this into text. A
talk textualizer, a research buddy, a polisher, and a rephrase. All
really useful. He includes the examples and he includes the
prompts. So, if you want it to do that kind of stuff, that paper is
really good for that. There's a really long-winded title called
Considerations for Adopting Higher Education Technology Course for
AI large language models, a critical review of the impact of chat
GPT. Now, this has come out as a pre-p proof. It's it's a really
good example, unfortunately, of how academic publishing cannot keep
up with the rate of technology change because these four academics
from the University of Prince Mugrin wrote it and submitted it on
the 31st of May. It has only just been accepted into the journal of
machine learning with applications
later, right?
Yeah. So, they spent 13 of the 24 pages detailing assessment
questions and which ones CH GPT got right or wrong. Now, I retested
it on some of those sample questions and it got them nearly all
correct. Now, the other thing they did, they tested AI
detectors.
What do we both know about AI detectors, Dan?
No, they don't work.
Yeah. But one thing that I thought was useful is they looked across
the top 15 universities to see if they had AI policies. I I'd
actually say that research is a warning about some of the research
we're seeing. You really need to check carefully if the conclusions
it made are still valid. Like did they test it with the current
open AI model or did they use a previous model? I think that the
way I think about when people evaluate can AI do something is not
yes they can can or no they can't. I actually think it's like we do
with students not mastered it yet.
That's kind of my feeling. Some other papers similar challenge
about delayed publishing a strengths and weaknesses analysis of
chat GPT in the medical literature. probably useful because they
looked at 160 papers. So if you want to know about chat GPT and
medical then there's 160 papers that are linked in it but a lot of
the results are out of date. There was some work done around
academic integrity and uh it was talking about academic integrity
in the age of chat GPT and generative AI. The paper was written in
August so not completely out of date references 37 papers but
probably the interesting thing is They looked at the top 20 QS
ranking universities for their policies around academic integrity
and AI and they created a nice simple model they called the three E
model. They said that when it comes to generative AI and acade
academic integrity think about three E enforcing integrity
educating faculty and students about responsible use and
encouraging the exploration. I I think that's really good.
Enforcing Yes. Educating. Yes. Encouraging the exploration. Yeah.
Absolutely.
Yes.
So, are you keeping track, Dan, of the exam papers that Jack GPT
has passed?
I was talking to the customer yesterday, actually.
Yes. It can be a financial analyst. It can be a stock broker. It
can pass the medical tests.
Everybody says about the bar exam.
Yeah. So, latest thing according to the research, it's now a
linguist. An expert linguist. So, these research Researchers from
universities in Zurich and Dortmund came to the conclusion that
yes, Chat GPT can pass exams in linguistics and their conclusion is
overall chat GPT reaches human level competence and performance
without any specific training for the task and has performed
similarly to the stat student cohort of a year one linguistic
tests.
Correct.
And I'm going to give the researchers bonus points. A lot of the
research is very dry and inaccessible. But this one, they were
testing about the understanding of a text about Luke Skywalker and
G unmapped galaxies. Fun for you, Dan.
Okay. So, I left the most important research paper to last. The
paper is called math education with large language models, peril or
promise. So, from that, Dan, you know, it comes from which
country?
Math.
Math. America. The US of course.
Exactly. So, actually it comes from some
Canada.
Yeah. You're Look at ahead of the notes. The research at the
University of Toronto and Microsoft Microsoft research. So
Microsoft researcher I remember when I used to work with them. It's
a bunch of people that are academic researchers. They just happen
to work for Microsoft but they do some amazing blue sky research
and this is the largest bit of research I think so far of large
scale pre-registered controlled experiments using GPT4 and looking
at it in the cont of maths education. So basically
yeah
they were looking at can a large language model be a personal tutor
and they did some proper AB testing to understand if we dealt with
students this way and this way and this way what are the
differences between them. So some students were not given any help
from a tutor. Some students were giving help from a tutor an AI
tutor after they had tackled some of the challenges and some were
given help from an AI tutor before they tackled the challenges. And
then what they did was they gave all of these students another
test. And the really interesting thing is they got one to three
standard deviation improvement in the test results using just
standard GPT4 and then they tried a customized GPT4 and they got
one and a half to four standard deviation improvements. So in test
results basically the students were getting 50% before they got
help from the AI tutor and 75% after that. There was there was a
message in there for me. A lot of people talk about, oh, we've got
to fine-tune the large language models and we've got to have a
special flavor of it, but we can actually get huge leaps using the
everyday one that is on your phone and mine and on our laptops.
It's a really good paper. I think of all the papers we've talked
about over the last few months, it's the one that I would
recommend. people read. So it's called math education with large
language models herald or promise and at the end of it they share
all of the prompts that they use. So if you want to do math
tutoring or maths tutoring then this is the paper to read. Go and
steal the prompts. It's really excellent paper.
That's it for the research Dan.
Y this that's phenomenal. There's some really good stuff there. I'm
really passionate with the math side of this and I need to unpick
that one because obviously the way the prompts and the chat GBT
kind of tools are now handling images and some of the maths
equations as an exmaths teacher at high school. I'm really
interested to see how some of the image recognitions working as
well and and and where the the kind of blurred line is between
maths formulas prompting for other differentiated purposes and
stuff. So that's that's going to be exciting one to read. But
there's great things there and I like the way that a lot of these
researchers are giving advice at the end of the research like those
three E sounds good and the and the roles that you mentioned about
the chunking analyst or whatever else they they really sound as if
they come into life and adding a bit more things that we can do and
actions at the end of the the research rather than you know 76% of
people are improved by using chatbt so what else has been on the
news of generative education have you seen anything else
so I saw some uh research from the US that said one in five US team
have used chat GPT for school work and and it went up by the time
you got to 11th and 12th grade. It was a higher proportion than
lower down, but it's American data. It's slightly old. One in five
have used it. I I think that might actually be a lot higher now.
But let me tell you one thing.
Yes.
Can I suggest it's another nail in the coffin for AI detectors?
Because if you're not detecting the one in five of your bits of
work, is generated by chat GPT, then it's another example of why
the AI detectors don't work.
Yeah, absolutely. And that's a great great data point. That's one
thing that's come out. But I suppose the things that I looked at
recently, last week or this week, this is how quick these podcast
episodes are being published. Now, this week, the UK government
published two research reports. One about generative AI and one
about technology in schools. And I know you've got an interesting
point on the technology in schools one, but the generative AI one
was the call for evidence that happened. They had about 560
responses from all around the education system in the UK and it's
informing the future policy design there. We put the link in the
show notes, but there were a couple of interesting data points in
there. One data point right the end of the report was that it was
about 78% of people saying that institution use generative AI in an
education setting. So the usage is high there. There was some
really good qualitative points as well that were picked up talking
about lesson plan. planning and the fact that he was making lesson
planning really quick. One of the directors of teaching and
learning mentioned that and they were thinking about idea
generation for teaching and learning and rejigging lessons. One
high school principal said in the requirements analysis that there
was a massive impact already in his school and in Mark course that
would typically take 8 to 13 hours in 30 minutes and gave feedback
to students. So there's a lot of use cases that was would appearing
in the report they were talking about automated marking, providing
feedback, supporting students with special educational needs and
EAL. So there was some really good things that that qualitatively
brought out even though it was lacking a little bit of the
quantitative detail in there. There was some really good responses
that talk through some of the some of the benefits, but also it
picked up some broad challenges around skills and user
understanding, which is some of the major things that people felt
they needed. to know more about using AI and prompt engineering.
The performance of the tools was kind of picked up, but obviously
teachers really worry about hallucinations in inside the the
generative AI uh world. There was a big discussion around attitudes
within the workplace which is quite interesting because it starts
to push on the administrative use of these tools and technologies
in schools rather than just in the classroom. Obviously the classic
things around access to the tools and then managing student use of
them. and then data protection. So really really interesting
report.
Yeah, I thought it was interesting that concern they were picking
up already was the risk to students and teachers if you become
overly reliant on generative AI and do your fundamental skills go
down? And I think about that a lot in the context of we could and
should be using generative AI to time save, especially for teachers
that are under this incredible time pressure. But we need to make
sure that we're not time-saving on the things that make a
difference. So, if I think about the planning process, there's
steps in the planning process that are really important because it
forces you to think about things and then there are steps that are
really dull and that don't have value like writing up the notes
afterwards.
Yeah, very true. The the the second report um which was uh released
at the same time and I I don't know if they did it purposely, but
it was about uh and we've been here before uh the UK government
doing a technology in school survey. So giving updated information
about how schools in England specifically were set up for using
technology. I suppose it is useful to give context then when people
are using generative AI and other technologies but there was
another another report that landed. Have you got any thoughts on
that?
Dan I remember 20 years ago when I was at RM I would have been all
over a report like this because it was you know let's count how
many network switches are in schools and are they managed and let's
count the laptops. There is some stuff that's in there that is
useful, but there's too much of it which is about counting bits and
bites and even the strategy document that talked about schools
having a strategy document.
The strategy wasn't about teaching and learning. The strategy was
about how do you manage your wires and cables and stuff like
that.
You're so you're so you're so true. So to summarize those two
things from the UK, generative AI is being used by early adopters
who say you're saving time and they came up with really interesting
uses but there are risks to manage but there's huge optimism from
educators generally and then schools of their technology they're
increasingly getting strategic about the tech use but it's still a
way to go before there's a kind of minimum tech standard happening
in the UK but
so I now I'm going to take just to one thing that I spotted linked
across both reports in the generative AI report teachers are
worried about big tech now you don't want to talk about this Dan
because you're part of the big tech world so so I will so they're
worried that that big tech might exercise undue power things like
misaligned incentives like it's all about the money and can we
sweep up all of this data versus the incentives for people,
students and teachers to be able to learn more. That was
interesting because I don't I don't see the world having been in
the world of big tech that isn't the incentive isn't all about
money and power and sweeping up student data. Often the discussions
about more data are about how can we get more data in order to
provide more value back to learners. But let me jump across to the
technology report. One of the things they asked is they said to
teachers where do you get input that you value for choosing
technologies and the top answers were other teachers, other
schools, research bodies, leading practitioners, which I think
means the edgu influencers on Twitter and LinkedIn
and big tech was not in the list. Down at the bottom of the list, I
imagine if they'd asked they might have put big tech, but the other
thing they put down the bottom of the list was their own
leadership. So they didn't tend to look at leadership within the
education system. They tended to look to their peers for good
advice.
Wow, that's that's a really interesting insight. I saw a bit of
research about Australian university students, how they're using
chat GPT. Really nice bit of research done down in Melbourne by
Gemma from Deacin and Natasha Zeble from Melbourne Uni. They had
done a survey of university students in semester 1 and semester 2.
By the end of semester 2, 82% Hey Dan, that's going in five again
another
using genative AI
and 25% using it in the context of university learning 28% using it
for assessment
wow that's great
okay now if that's the number of students that are using it about
academics what they found was that just 14% of academics were using
it in semester 1 and 16% in semester 2 so something like one
quarter of the academics are using it compared to the students.
Can
can I can I ask a question about that? Just generally what's
generally the split of casual academics versus full-time academics
in unis. I know that's a sort of wide question, but is the
statistics on that? I'm just wondering because they got a lot of
academics in this system. It's unlike a school. I'm just wondering
whether a lot of casual academics might not got might not get as
much professional development. They might be like subject knowledge
experts like accountants and things like that on accountants
courses. I'm just wondering what BD they get.
Yeah, great great theory Dan and my theory was going in the other
direction. So about 23 I think from the research I've seen over the
years of academics are casual 2/3 right now here's my supposition
the casual academics use AI more and the reason I'm saying that is
because they are out in the commercial world to three days a week
and then teaching at university a couple of days a week. I remember
that's what the pattern was for my daughter's courses. And so I'd
be willing to bet that they're more likely to be using in their
professional practice and then bringing it into their academic
practice rather than they're missing out on some academic
training.
We'll have to find somebody who might have some actual data rather
than theories on that.
It's interesting. And then finally, I suppose something that's
landed on our desks this morning, 1st of December, and I think this
is worth unpacking again and I know we've looked at some of the
drafts already but the AI framework for schools in Australia was
released today and some really interesting guiding principles in
there from teaching and learning human and social well-being
transparency fairness accountability and privacy and and this is
where it was really interesting when we were looking at AI
detectors in the past and I was looking through some of the
policies or I was looking at this and the transparency and
accountability elements were were quite evident in here which is
really nice to see and it also put a lot of responsibility on
schools as well. If you are going to be using a eye detectors and
picking up Ray's uh geography homework for possible uses of
generative AI then you do a responsibility for people to challenge
that and the student to say well hey I use generative AI in this
particular way. So I think it's a very mature document. I need to
look at the devil in the detail around this now in the next couple
of hours today but I'm glad it's been released.
Okay. I've not seen the final version. So, I'll go and have a read
of it. We'll put the link in the show notes, but in a couple of
weeks time, let's find 10 or 15 minutes to talk through it and
let's see if we can find somebody smarter than both you and I put
together to talk about it. Let's see if we can find somebody
involved in the drafting of it. And yeah, let's dig down deeper
into it. But, as you say, there's some really specific advice and
guidance for schools about what they need to do about responsible
AI usage.
Yeah, can't wait. Well, what what are we What a week.
Dan, can we can we have the news just slowing down because we're
aim to do this in 20 minutes every week and here we are. I think
we've overrun again. So, next week, Dan, the podcast is another
interview from the AI in Education Conference. We've got the longer
interview with Matt Estman on the podcast.
Back in two weeks time with more news. See you soon. Bye, Dan.
Brilliant. That research was phenomenal. Holy smokes.