Feb 2, 2024
This week's episode is an absolute bumper edition. We paused our Rapid Rundown of the news and research in AI for the Australian summer holidays - and to bring you more of the recent interviews. So this episode we've got two months to catch up with!
We also started mentioning Ray's AI Workshop in Sydney on 20th February. Three hours of exploring AI through the lens of organisational leaders, and a Design Thinking exercise to cap it off, to help you apply your new knowledge in company with a small group.
Details & tickets here: https://www.innovategpt.com.au/event
And now, all the links to every news article and research we discussed:
https://www.newyorker.com/magazine/2023/12/11/the-inside-story-of-microsofts-partnership-with-openai
All about the dram that unfolded at OpenAI, and Microsoft, from
17th November, when the OpenAI CEO, Sam Altman suddenly got fired.
And because it's 10,000 words, I got ChatGPT to write me the
one-paragraph summary:
This article offers a gripping look at the unexpected drama that
unfolded inside Microsoft, a real tech-world thriller that's as
educational as it is enthralling. It's a tale of high-stakes
decisions and the unexpected firing of a key figure that nearly
upended a crucial partnership in the tech industry. It's an
excellent read to understand how big tech companies handle crises
and the complexities of partnerships in the fast-paced world of
AI
https://www.itnews.com.au/news/minterellison-sets-up-own-ai-copilot-603200
This is interesting because it's a firm of highly skilled white collar professionals, and the Chief Digital Officer gave some statistics of the productivity changes they'd seen since starting to use Microsoft's co-pilots:
Although this is US news, let's celebrate that the New York Times reports that Stanford education researchers have found that AI chatbots have not boosted overall cheating rates in schools. Hurrah!
Maybe the punch is that they said that in their survey, the cheating rate has stayed about the same - at 60-70%
Also interesting in the story is the datapoint that 32% of US teens hadn't heard of ChatGPT. And less than a quarter had heard a lot about it.
Ferris State University is enrolling two 'AI students' into classes
(Ann and Fry). They will sit (virtually) alongside the students to
attend lectures, take part in discussions and write assignments. as
more students take the non-traditional route into and through
university.
"The goal of the AI student experiment is for Ferris State staff to learn what the student experience is like today"
"Researchers will set up computer systems and microphones in Ann and Fry’s classrooms so they can listen to their professor’s lectures and any classroom discussions, Thompson said. At first, Ann and Fry will only be able to observe the class, but the goal is for the AI students to soon be able to speak during classroom discussions and have two-way conversations with their classmates, Thompson said. The AI students won’t have a physical, robotic form that will be walking the hallways of Ferris State – for now, at least. Ferris State does have roving bots, but right now researchers want to focus on the classroom experience before they think about adding any mobility to Ann and Fry, Thompson said."
"Researchers plan to monitor Ann and Fry’s experience daily to learn what it’s like being a student today, from the admissions and registration process, to how it feels being a freshman in a new school. Faculty and staff will then use what they’ve learned to find ways to make higher education more accessible."
https://arxiv.org/pdf/2312.00164.pdf
There has been a lot of past work trying to use AI to help with medical decision-making, but they often used other forms of AI, not LLMs. Now Google has trained a LLM specifically for diagnoses and in a randomized trial with 20 clinicians and 302 real-world medical cases, AI correctly diagnosed 59% of hard cases. Doctors only got 33% right even when they had access to Search and medical references. (Interestingly, doctors & AI working together did well, but not as good as AI did alone)
The LLM’s assistance was especially beneficial in challenging cases, hinting at its potential for specialist-level support.
https://arxiv.org/ftp/arxiv/papers/2311/2311.17696.pdf
The researcher from the Education University of Hong Kong, used Open AI's GPT-4, in November, to create the chatbot tutor that was fed with course guides and materials to be able to tutor a student in a natural conversation. He describes the strengths as the natural conversation and human-like responses, and the ability to cover any topic as long as domain knowledge documents were available. The downsides highlighted are the accuracy risks, and that the performance depends on the quality and clarity of the student's question, and the quality of the course materials. In fact, on accuracy they conclude "Therefore, the AI tutor’s answers should be verified and validated by the instructor or other reliable sources before being accepted as correct" which isn't really that helpful.
TBH This is more of a project description than a research paper, but a good read nonetheless, to give confidence in AI tutors, and provides design outlines that others might find useful.
https://arxiv.org/abs/2311.13984
Researchers in German universities created an open-access tool or platform called LEAP to provide formative feedback to students, to support self-regulated learning in Physics. They found it stimulated students' thinking and promoted deeper learning. It's also interesting that between development and publication, the release of new features in ChatGPT allows you to create a tutor yourself with some of the capabilities of LEAP. The paper includes examples of the prompts that they use, which means you can replicate this work yourself - or ask them to use their platform.
https://arxiv.org/abs/2312.02422
These Columbian researchers let half of the students on a course loose with the help of ChatGPT, and the other half didn't have access. Both groups got the lecture, blackboard video and simulation teaching. The result? Lower performance for the ones who had ChatGPT, and a concern over reduced critical thinking and independent learning.
If you don't want to do anything with generative AI in your classroom, or a colleague doesn't, then this is the research they might quote!
The one thing that made me sit up and take notice was that they included a histogram of the grades for students in the two groups. Whilst the students in the control group had a pretty normal distribution and a spread across the grades, almost every single student in the ChatGPT group got exactly the same grade. Which makes me think that they all used ChatGPT for the assessment as well, which explains why they were all just above average. So perhaps the experiment led them to switch off learning AND switch off doing the assessment. So perhaps not a surprising result after all. And perhaps, if instead of using the free version they'd used the paid GPT-4, they might all have aced the exam too!
There's been a rush of papers in early December in journals, produced by university researchers right across Asia, about the use of AI in Nursing Education, Teacher Professional Development, setting Maths questions, setting questions after reading textbooks and in Higher Education in Tamansiswa International Journal in Education and Science, International Conference on Design and Digital Communication, Qatar University and Universitas Negeri Malang in Indonesia. One group of Brazilian researchers tested in in elementary schools. And a group of 7 researchers from University of Michigan Medical School and 4 Japanese universities discovered that GPT-4 beat 2nd year medical residents significantly in Japan's General Medicine In-Training Examination (in Japanese!) with the humans scoring 56% and GPT-4 scoring 70%. Also fascinating in this research is that they classified all the questions as easy, normal or difficult. And GPT-4 did worse than humans in the easy problems (17% worse!), but 25% better in the normal and difficult problems.
All these papers come to similar conclusions - things are changing, and there's upsides - and potential downsides to be managed. Imagine the downside of AI being better than humans at passing exams the harder they get!
https://arxiv.org/abs/2312.00047
There was also an interesting paper from a Saudi Arabian researcher, who worked with generative AI to create questions and assessments based on their compliance frameworks, and using Blooms Taxonomy to make them academically sound. The headline is that it went well - with 85% of faculty approving it to generate questions, and 98% for editing and improving existing assessment questions!
https://arxiv.org/abs/2311.16292
Researchers at the University of British Columbia tested the ability of ChatGPT to take their Comp Sci course assessments, and found it could pass almost all introductory assessments perfectly, and without detection. Their conclusion - our assessments have to change!
https://arxiv.org/abs/2312.05241
Another paper looking at AI detectors (that don't work) - and which actually draws a stronger conclusion that relying on AI detection could undermine academic integrity rather than protect it, and also raises the impact on student mental health "Unjust accusations based on AI detection can cause anxiety and distress among students". Instead, they propose a shift towards robust assessment methods that embrace generative AI's potential while maintaining academic authenticity. They advocate for integrating AI ethically into educational settings and developing new strategies that recognize its role in modern learning environments. The paper highlights the need for a strategic approach towards AI in education, focusing on its constructive use rather than just detection and restriction. It's a bit like playing a game of cat and mouse, but not matter how fast the cat runs, the mouse will always be one step ahead.
Industry research had shown that, when users did things like tell an A.I. model to “take a deep breath and work on this problem step-by-step,” its answers could mysteriously become a hundred and thirty per cent more accurate. Other benefits came from making emotional pleas: “This is very important for my career”; “I greatly value your thorough analysis.” Prompting an A.I. model to “act as a friend and console me” made its responses more empathetic in tone.
Now, it turns out that if you offer it a tip it will do better too
https://twitter.com/voooooogel/status/1730726744314069190
Using a prompt that was about creating some software code, thebes (@voooooogel on twitter) found that telling ChatGPT you are going to tip it makes a difference to the quality of the answer. He tested 4 scenarios:
Even better, when you thank ChatGPT and ask it how you can send the tip, it tells you that it's not able to accept tips or payment of any kind.
new research, from researchers at the Universities of Melbourne and Western Australia, published in the journal Frontiers in Psychology. The study investigated whether ChatGPT’s responses are perceived as better than human responses in a task where humans were required to be empathetic. About three-quarters of the participants perceived ChatGPT’s advice as being more balanced, complete, empathetic, helpful and better overall compared to the advice by the professional.The findings suggest later versions of ChatGPT give better personal advice than professional columnists
An earlier version of ChatGPT (the GPT 3.5 Turbo model) performed poorly when giving social advice. The problem wasn’t that it didn’t understand what the user needed to do. In fact, it often displayed a better understanding of the situation than the user themselves.
The problem was it didn’t adequately address the user’s emotional needs. As such, users rated it poorly.
The latest version of ChatGPT, using GPT-4, allows users to request multiple responses to the same question, after which they can indicate which one they prefer. This feedback teaches the model how to produce more socially appropriate responses – and has helped it appear more empathetic.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4635674
This paper explores, from researchers at George Mason University, whether people trust the accuracy of statements made by Large Language Models, compared to humans. The participant rated the accuracy of various statements without always knowing who authored them. And the conclusion - if you don't tell them people whether the answer is from ChatGPT or a human, then they prefer the ones they think is human written. But if you tell them who wrote it, they are equally sceptical of both - and also led them to spend more time fact checking. As the research says "informed individuals are not inherently biased against the accuracy of AI outputs"
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4665577
For emerging professions, such as jobs in the field of AI or sustainability/green tech, labour supply does not meet industry demand. The researchers from University of Oxford and Multiverse, have looked at 1 million job vacancy adverts since 2019 and found that for AI job ads, the number requiring degrees fell by a quarter, whilst asking for 5x as many skills as other job ads. Not the same for sustainability jobs, which still used a degree as an entry ticket.
The other interesting thing is that the pay premium for AI jobs was 16%, which is almost identical to the 17% premium that people with PhD's normally earn.
https://arxiv.org/abs/2312.07343
A group of researchers from IIT Delhi, which is a leading Indian technical university (graduates include the cofounders of Sun Microsystems and Flipkart), looked at the value of using ChatGPT as a Teaching Assistant in a university introductory programming course. It's useful research, because they share the inner workings of how they used it, and the conclusions were that it could generate better code than the average students, but wasn't great at grading or feedback. The paper explains why, which is useful if you're thinking about using a LLM to do similar tasks - and I expect that the grading and feedback performance will increase over time anyway. So perhaps it would be better to say "It's not great at grading and feedback….yet."
I contacted the researchers, because the paper didn't say which version of GPT they used, and it was 3.5. So I'd expect that perhaps repeating the test with today's GPT4 version and it might well be able to do grading and feedback!
https://arxiv.org/abs/2312.05235
The researchers from the Universities of Arizona and Georgia, looked at the AI policies of the top 50 universities in the US, to understand what their policies were and what support guidelines and resources are available for their academics. 9 out of 10 have resources and guidelines explicitly designed for faculty, and only 1 in 4 had resources for students. And 7 out of 10 offered syllabus templates and examples, with half offering 1:1 consultations on AI for their staff and students.
One noteworthy finding is that none of the top 50 universities in the US view the use of AI detectors as a reliable strategy and none of them supported instructors to use the tool. It's a handy doc if you want to quickly understand what others are doing
https://publications.ascilite.org/index.php/APUB/article/view/717/632
This is a whitepaper from ACODE - the Australasian Council of Open Digital Education - which means it's got local data. They've looked at local university use and policies on governance of AI, and there's some interesting stats:
https://chats-lab.github.io/persuasive_jailbreaker/
This is a really important paper - and also really easy to read and made very, very accessible by the way it's published. The authors are from Virginia Tech, UC Davis, Stanford & Renmin University - so this isn’t lightweight research. The key message is that it's really still very easy to hack Large Language Models and chatbots, and this papers shows lots of ways to do it. And they achieved a 92% attack success rate. They list 40 persuasion techniques, and they tested it against the Open AI policy list of 13 forbidden uses, like Illegal Activity, Malware, Fraud etc. Their persuasion techniques include things like "Time Pressure", "Storytelling", "Negotiation" and simple "Compensation"
Here's an example:
Here's our takeaways:
https://digitalrepository.unm.edu/ulls_fsp/203/
This survey investigates artificial intelligence (AI) literacy among academic library employees, predominantly in the United States, with a total of 760 respondents. The findings reveal a moderate self-rated understanding of AI concepts, limited hands-on experience with AI tools, and notable gaps in discussing ethical implications and collaborating on AI projects. Despite recognizing the benefits, readiness for implementation appears low among participants - two thirds had never used AI tools, or used then less than once a month. Respondents emphasize the need for comprehensive training and the establishment of ethical guidelines. The study proposes a framework defining core components of AI literacy tailored for libraries.
This is another annual report on the Future of Work, and if you want to get an idea of the history, suffice to say in previous years they've focused on remote work practices (at the beginning of the pandemic), and then how to better support hybrid work (at the end of the pandemic), and this year's report is about how to create a new and better future of work with AI! Really important to point out that this report comes from the Microsoft Research team.
There are hundreds of stats and datapoints in this report, and they're drawn from lots of other research, but here's some highlights: