Oct 9, 2019
This week, Dan and Ray talk about Predicting the Future, and how we use the past to predict the future. We discuss things like correlation and causation, and what to be aware of when using predictive analytics and machine learning to influence outcomes. We start by discussing the link between education outcomes and backyard swimming pools, continuing through the amount of data that sits in education institutions, and how it might be used to predict the future for a student or cohort, and plan appropriately
During the episode you'll learn:
- Why correlation, not causation, might lead to the belief that to improve education outcomes we need more backyard swimming pools
- Why, through data quality efforts on missing data, there are lots of Australian students registered as living at the North Pole
- How 8 pieces of data can predict a student dropout
TRANSCRIPT FOR The AI in Education Podcast
Series: 1
Episode: 3
This transcript and summary are auto-generated. If you spot any important errors, do feel free to email the podcast hosts for corrections.
This podcast excerpt features hosts Dan Bowen and Ray Fleming discussing the profound role of data in predicting the future, particularly within the education sector, likening data to "the new oil." A central theme explored is the critical distinction between correlation and causation, with Fleming providing a compelling example of pool ownership correlating with higher NAPLAN test scores, yet lacking any causal link. The conversation covers how data is used to predict everything from consumer behaviour (like targeted advertising) and weather patterns to complex educational outcomes, such as student retention and academic performance based on historical and real-time data. The hosts also touch upon the ethical implications of predictive analytics, citing examples like Minority Report and data quality issues, before concluding with the importance of focusing on a clearly defined business problem when applying AI and data analysis to achieve meaningful outcomes.
Hi folks, welcome to podcast episode 3 with myself Dan Bowen and
my colleague Ray Fleming. Today we're going to segue in from the
last couple of podcasts we've done around AI and focusing on AI and
education and kind of take that a little bit deeper. So, a little
bit deeper around one of the concepts around the data element to it
and how data is like the new oil. But before we start, let's just
introduce ourselves again. Ray.
Uh, I'm Ray Fleming. I'm the higher education lead for Microsoft
Australia. My background, as I've said before, is uh I'm a
technologist. I'm an education technologist, but I'm not an ex
anything. I'm not an ex teacher or an ex lecturer.
Yeah. How about you, Dan?
I'm Dan and I'm an ex everything. I'm not doing anything. I'm an ex
teacher at ex school in inspector and currently working with
Microsoft as an account strategist to look outside the boat for all
of our customers in Australia in education. So, um, Ray, data is
the new oil was where we ended in the last podcast and, um, moving
on that, have you got any thoughts about how that might might
transpire for this episode?
Yeah, well, look, we know that we know we're surrounded by data and
data is everywhere, but anybody that's done statistics and used
data at uni, and I suspect a lot of people listening to this will
have done knows that it's not the answer to everything. You know,
there's stuff around statistics about just because something is
linked to something else doesn't mean it causes it. You know, the
correlation and causation thing
and uh and I'll give you an example of that. I was doing some work
with NAPLAN data.
So, looking at uh NAPLAN performance across schools
and linking it to data that was correlated and I found one set of
data that had a really strong correlation. So, actually there was
this really strong line line that you could draw through the data
that said that nap plan results went up right
in suburbs where there were more backyard pools because the the
Queensland government issued a register of of pools. Uh and so I I
plotted those two sets of data together because I thought this is
interesting and of course the answer was the more pools you had the
uh the higher the net plan results went up. And so that's an
interesting correlation but it's definitely not a causation because
you're not going to go and build more swimming pools. in order to
improve your nap plan results. Go to parents evening and tell
everyone they should be putting a backyard pool in.
Very true. Yeah. So that's interesting, isn't it? Because that data
is available, but it is about that causation and actually what we
going to do about it because we we want to do that predictive and
analytics and kind of predict the future. And I suppose today's
episode is about predicting the future. And it sounds scientific
and kind of like like uh science fiction, but actually when you've
got lots of data, you can predict what's going to happen
statistically.
If people hold on on to the end, we can even the lottery
numbers.
Of course we are
because I guess it's that that how do we use the past to predict
the future and and part of the reason to talk about it is because
we're going to have more and more people involved with the data and
so we need to broaden that understanding across the whole
organization about how we use data.
So so let's think what's some examples then what are some examples
about predicting the future with data?
Oh well I I'll tell you a really really simple example with just
one piece of data which is I went last week to look at flights to
Melbourne. on the Virgin Australia website for my daughter and uh
for the next 3 days all I got on other websites was adverts from
Virgin Australia
offering me flights to Melbourne and so that's a really good
example very very clear and simple example of using the past to
predict the future because I had been to the Virgin Australia
website their prediction was that I was going to buy a flight and
they knew that I'd looked at Melbourne flight so their prediction
was I was going to go and do that that's a one data point as past
to predict the future.
Yeah. And then I suppose if you've got lots of data points, things
that we're all used to and things that are getting more and more
accurate over time because of the models that we're using to kind
of run simulations are things around um weather forecasting where
it takes a lot of data inputs and become more and more accurate um
around the the kind of weather patterns, tornadoes, tsunamis and
things um to kind of drive that. So I suppose you got the small
amounts of data which have got a simple effect with your with your
Virgin online advertising, but then also you've got some of the
larger data sets, for example, with a weather forecast.
And you've also with with something like weather forecasting or
traffic, you've also got that historical data
versus current data. So, if you think about weather, you've got
climate versus weather. So, climate is historically the winter is
cold and the summer is hot.
And then you've got the current stuff, which is the weather stuff,
which is tomorrow it's going to be 27°. And if you think about
that, uh, an area We see that in our our world all the time now is
traffic prediction.
So when you go to get directions to go somewhere, it tells you how
long it's going to take. And that is based on both historical
information, typically on a Monday morning, it takes a long time to
go over the harbor bridge in Sydney,
and then real time data, which is this morning there's a protest on
the bridge and it's jammed up and it's going to take you an hour.
So we have the same situation in education as we do in other
industries, which is thinking about historical data. But also
though, what can the real time tell me?
Yeah. And and when we're looking at it from an education point of
view, then it's an interesting point when you're looking at traffic
and things like that because ultimately it's the same kind of
paradigm that you could use when you're talking about assessment
and and student data because you know, ever since I was teaching,
we always had certain data sets, say in the UK at that point, where
you'd have a a rough idea of where students would land based on
standard assessment tests that they would have done in say primary
school. I was a secondary teacher, so I had the data from the
primary school and I would say within a particular um confidence
ratio the stu this particular student would have uh would be
achieving this particular goal in English, maths and science and
then you could kind of start to say well and and it did feel like I
was predicting the future and as a teacher it felt it felt quite
awkward for me because I was using that data to inform the students
but I was essentially saying to kids who were you know young you
know at 11 years old saying well you are going to get this result
when you leave school with a pretty good confidence rating. Um, and
that that would make me feel uneasy in one aspect, but also empower
the kids to actually make a decision with their life and what they
should be achieving. But it's much more complicated than that.
Right.
Well, you also got that real time versus historical thing which is,
you know, I think you're going to perform this at this level and
then you get to the real time stuff which is last night in your
homework you smashed it or you did everything but you had a real
difficulty with this topic. So, it's then how do you provide that
individual support. So yeah, it's that's a really good example
taking from historical data and current data to be able to do
something in education. Um there's also times when it's much more
difficult to predict things for the future. So uh an example would
be bullying. Um I've had a number of education customers saying
well can we predict bullying incidents in order that we can
intervene early? And one of the interesting questions often is well
what data do you keep on that because if you're going to use the
data from the past to predict the future, you've got to have
historical data. And so in many cases, the data hasn't been
collected. And so you can't use it to build a model of future
behaviors.
And and that's the same, you know, when when I was inspecting
schools like uh several years ago as well, you know, even if though
we didn't analyze that data um you know, in the way we're talking
about now, we did ask those questions of school leaders. We say,
"What data are you collecting?" Because if there is an issue for
example, in the school of bullying, then where are you collecting
our data? What data are you collecting to give you informed
information about what you could do to support bullying behaviors
in your school? So again, it's an age-old problem, but it's putting
the the new lens of AI onto it.
So this is crystal ball gazing into the future, Dan.
Fantastic.
So uh tell me your your favorite examples because I I see that kind
of vision about data predictions used in movies all the time.
Yeah. Well, you think about Well, for me it's got to be Back to the
Future series. And I think it's Back to the Future 2 where Biff
goes back and uh steals a comic from uh the the kind of time
machine which has got all of the the almanac results for all the
baseball and horse racing games uh in the US. And I suppose what
that did, it illustrated the the power of what you could do with
that data if you had the result of all all the games and all the
sports events coming up like England winning the cricket. Not that
would happen. But um you know, you could you could um grab that uh
data and actually make your own gains with that data as well as you
know, it completely changed the facet of the character in the film,
the film itself and the plot changed significantly just because of
that one moment. And all all be it that being very science fiction,
it was an interesting permutation because really if you look at
racing and form and all of that data or cricket or whatever, you
know, you could technically predict a lot of these things,
right?
It is slightly cheating though in Back to the Future because
they're taking the data from the future and using it.
Yeah, it's true. Yeah, exactly. Because they didn't have machine
learning at that time. That's right.
So something like Moneyball's an interesting example as well
because that's a true story of um how a coach used uh performance
data with baseball teams and was starting to use the data in order
to form the team. So rather than going on gut instinct, it was this
player constantly outperforms or identifying early career people
and going well this is going to be a superstar this person will be
a superstar in a couple of years time so I want them on my team now
that that's the kind of using the past and the current to predict
the future that's a that's an example about where using data to
make your decisions I mean we've been on that journey inside
Microsoft over the last few years we've had a real focus about how
do we make decisions with data how do we make it less about
opinions and anec data and more about real data so you know, if we
have a billion users, how do we use the information and the
telemetry from that billion users to improve products or to make
decisions about what we do?
But but then I do have I do have a bit of a problem with some of
that because you saying there about being able to kind of then
select people based on their uh you know they their kind of
inherent abilities or whatever they may be. But then you look at a
film like Minority Report and then you know where where you're kind
of highlighting where the criminals are and then going back and
raising them from from the past. So, you know, you there's a fine
line there between being able to predict those things and then also
what action you're going to take to drive.
Yeah. And and we're getting also onto the ethics bit as well, but I
saw a report that somebody had shared on social media
and it was telling them that they would make an excellent presenter
and PowerPoint user based on the genetic profile that they'd done
on 23 and me,
which like I I'm not sure if I believe that's real.
You know, the fact that 23 and do some, you know, look at your DNA
and then as a result go, you're going to be a confident presenter.
Well, you know, it's it's almost like a self-fulfilling prophecy
with that kind of thing because if I told you you were an awesome
presenter, then
you'd get up on stage and be more confident. And if I told you you
were going to bomb, you'd bomb. So, you know, sometimes I can it's
like used in a a really dystopian way. And and the other thing is
data quality. Um making sure that we've got the data in the right
place. and that it's correct, you know, because the consequences
can be pretty severe. Maybe not quite as severe as they are they
are in Brazil. So, Brazil is the film,
not the Bliss.
No, the film by Terry Gilliam. Uh, one of my favorites. It's a
dystopian vision of the future and um there's a a m an office where
they are typing out the list of people to be arrested and somebody
called uh George Tuttle is supposed to be arrested, but a fly falls
into the typewriter. Oh no.
And a bee gets typed instead. So, instead George Butle is arrested
and put into jail. Um, and that might seem really funny, but I once
had a driving license in the name of Raymon Fleming instead of
Raymond Fleming because somebody typing in my name when I was
renewing my driver license, just typed in the wrong thing.
Imagine
disaster. Yeah.
The downscale consequences of that just could be huge. I mean, the
consequence for me is I couldn't use my driving license for ages
for my 100 points of ID because it didn't agree.
And and the data quality, you know, when I was teaching, I used to
see a lot of data quality issues coming through because often the
data would come from different schools from you know transitional
kind of um bodies and things like that but but often it would come
through and then maybe students from with a foreign nationality and
then they didn't know what gender they were and things like that
and and and there was a lot of you know errors in that data that's
going through and obviously that if you get errors in that data in
the initial stages then the quality of the data coming out
and it's pretty complex isn't it because when you think about the
data sources and the data repositories in schools.
Yeah,
there's some pretty big lists. I mean,
so where are they where are they from then? What do what do you
I'm going to start where closest to my heart from my background is
the student information systems. So, you know, having worked with
student information system providers in the UK, they're massive,
massive stores of data. You know, they've got your student
demographic information. You're absolutely right. You start with
the student registering where you may not know everything
and so you're putting in codes. I mean, my favorite example of that
is in Australia Yeah.
So when you're registering people on the higher education database
when you're uploading data if you don't know the postcode of the
student you have to code it as 99999.
Right.
Right. Did you know that Australia's got a post code 9999? Yeah.
No, it really is. It's the North Pole. So when you write to Santa,
you write to Santa at 9999. Every student with an unknown address
up until this year has been coded as living at the North Pole.
That's brilliant.
Um but that there's there's an interesting data quality issue
because if you think about using that. But you know, so you've got
your core student data, you've got attendance data, you've probably
got assessment marks in there, you've got all kinds of different
data sets that's stored in that place. And you think about, well,
that's a massive, massive trove of data that is often unrelated to
other
and the paradigm of data being the new oil would then I suppose for
some companies then make them want to capitalize on the fact that
they've got that oil, that data.
Oh yeah. And in some scenarios when you think about education
startups and and actually this isn't just education but startups
generally one of the acquisitions is how do we acquire data because
the data could be worth more than the organization and we've had
you know free learning management systems that have closed down
where this is not in Australia in the US where the business closed
down but the asset that they have to sell is the data and they go
and sell that
um but learning uh the student information system is one rich
source the other one is is probably the learning management system
Yeah. And and I've seen quite a lot of those LMSs in in schools and
and they they range from holding data on assessments to to um class
grouping and and you know they they they kind of really do try to
do everything with the learning elements, but often they kind of
start to um come unstuck when it comes to correlating that data to
what's inside a school information system. So what we've seen is
usually a linkage between um an learning management system bringing
data in from a school information system to try to populate the the
LMS with as much rich data as they can to make even better
decisions because often those LMS's give you reports and analytics
about the students performance academically. So it's more of an
academic repository in an LMS rather than a
it's also a lot of transactional data
because if you think about it it's like how often do they log on I
saw some reporting today actually about the Victoria University or
the um where the number of times per week that a student logged on,
the number of days per week that they accessed the LMS had a direct
correlation to their pass rates.
Really?
So the more that they used it, the the the higher the pass rate in
the course. And um that same thing is then starting to go this is
where we go from just using data and analyzing stuff into
artificial intelligence because there is so much data that you
can't possibly analyze it all yourself. That's where you have to
hand it over to our artificial intelligence system to say work out
what the relationship of the data is. So for example, lots of
conversations about learning management systems about well when do
they log on to download their assignment, when do they submit their
assignment, have they watched the lecture, have they looked at
this, have they looked at that. But it turns out that when you look
at all that stuff through an AI lens where instead of you knowing
the answer, you ask the artificial intelligence system to work out
what's important, probably the most important thing uh in a
learning management system and I read this from last year from the
world's biggest learning management system provider was had a
student logged on to look at their marks that was a bigger
predictor of attrition than all of the other data that they were
collecting in there and that's where AI becomes really useful
yeah because you spot those hidden patterns
that's just two bits of data there's others as well
yeah and and that's just the other the other elements of where that
data is uh you know include the third party applications that we
spoke of the last podcast you know the really large applications
like the mathematics of the world learning eggs of the world. Um
the was it reading eggs?
Reading eggs.
Reading eggs. You know, learning eggs. Learn about eggs. But no,
we're going to talk about literacy this time. But there's a lot of
these third party applications that that um schools might be using
the whole data. Some might be photographic data. Um based on the
students learning in a in a in a play scenario for example for like
early learning. Um so you know the metadata that you could bring
out to that about the learning would be quite interesting as well.
And then also the integrated product ity platforms that uh schools
will be using things like the Microsoft 365 suite that kind of
connect together your communication tools, your email, um your
daily productivity tools like your word and your PowerPoint and
your Excel. Um the the the telemetry about you knowing what you're
doing yourself personally through say the my analytics tool. So
every week I get that data fed to me personally to tell me you know
how productive I've been, how effective my meetings have been. how
I should communicate more with my manager and things like that. So,
there's a lot that feels quite personal to me. So, it goes from,
you know, that entire wealth of that platform to to the other
platforms that I'm in.
Well, and then you're going to get and we'll come back to this in a
later podcast, that thing around the ethics, the creepy line as I
call it. You know, where is it okay to use data? I I wear a
Fitbit.
Um, and I religiously am looking at my Fitbit data, making sure
that I've moved enough each day. Uh, sometimes I will I take the
dog out for a walk at 9:00 to hit my 10,000 step target.
But I do that for me, not for anybody else. If if my boss made me
wear a Fitbit, I'd probably have a completely different attitude.
And so if you think about the my analytics that comes into
office,
Yeah.
Uh and that weekly report you get that tells you, are you getting
enough focus time? Are you are people reading the emails you send
to them? How are you responding to emails they send to you? Um
using it as a tool to help me do my job better is something I'm
cool with. Using it as a tool to beat me with, you know, is
something I'm less cool with. And and it's always been that
situation in analytics in in education. But we've as we start to
use AI more,
we're going to have much more contact with data and the
consequences of data directly one-on-one. So, we've got to think
about how the user might see how we build some of these analytic
systems.
Yeah. And and what what about these these other data sets that we
bring in because you did a interesting uh project didn't you up in
Queensland.
Yeah. So going back to um that relationship yeah
uh correlation and causation between education data and other
things. So um I'd also been doing plotting some work around the
relationship between NAP plan scores for schools and some of the
ABS statistics because there's a really deep rich vein of data but
it's published at aggregate level at um suburb level. So you can
take for example and say well show me the relationship between
parental education, so how many people in the suburb have got
degrees and what happens to nap plan scores. And what I found was
using some of the public data, using just um parents education and
employment rates, I could explain 65% of the difference in nap plan
schools between schools
just from public data, not any education data. So you know when we
think about our sources of data, it's not just the data that we
collect. can collect and store at an education level. It might be
some of the public data as well and helping us do better
and lots of schools I know lots of Catholic dascese and things will
be looking at community data as well. Data that they get from the
parishes, you know, and and universities that get information from
all kinds of different sources as well. So, um it's really
interesting when you start to correlate that data together.
But but the challenge then is you've got so much data.
Yeah.
Potentially how to use it and and that's why you've got to get down
to a a focused conversation. about what is the business problem
that we're trying to solve, you know, and and I think often people
kind of forget that problem a little bit, but you know, in in the
case of the stuff that I was trying to do around Napan, it was the
business problem I was trying to solve was um how can we identify
the um schools that might provide a good practice example to others
because they're beating their prediction of where they should be on
that plan. Rather than it being the school with the highest score,
it's actually the school that maybe has a lower score, but would be
predicted to do um an even lower score or um you know taking a look
at at students the prediction around student retention. So you know
let's dig into that a little bit because one in five students in
Australia drop out. So that's either one in five don't graduate
year 12 or one in five drop out of university in the first
year.
And so then you look at that from a well how do we keep student
retention? How do we keep students in education until they achieve
their goal? and how do you use those data sets? So that's a
business problem. That isn't a theoretical maths problem. It isn't
a this would be interesting to know. You start with the business
problem which is how do we help students to succeed and part of
that success is keeping them to the end.
Yeah. And keeping them across systems as well from primary school,
secondary school in that particular system that they're in.
Yeah.
And so then your conversation around that becomes how do we help
solve the business problem not that we help do the m the the
theoretical maths problem.
Yeah. One of the and one of the key ones that we've always talked
about cross education that personalization element and that's
always been the tricky one. Learning management systems, school
information systems have always tried to promise that. We've never
really hit that panacea because of the fact that the door the data
has been disperate. So actually um you know looking at where that
data is what we can do for the future but then also what strategies
we can put in place. So for example, what interventions we can use
around lit around well-being around um uh suicide awareness all
those indicators can pull down to give us more personalized
information not only academic performance but also the well-being
of the children in in our care I suppose
and just going back a little bit to our conversation in the first
episode of the podcast about why now
part of the reason why the why now conversation is that the tools
that we have to help us to be able to do this work are much more
accessible to more people in the organization to be able to do it.
So if I take an example uh about student retention. So I worked
with a an organization about uh 18 months two years ago around
predicting dropout of students in TA and you only needed eight
pieces of data to be able to predict with 92% accuracy which
students were going to drop out.
Wow.
Now the technology you needed two years ago to be able to do that
is you needed some data science skills.
Well, now you don't because the tools are moving so fast. Um, you
you know the Titanic example, don't you? Because I remember you
showing me that ages ago about building a machine learning model to
predict who would and wouldn't survive the Titanic.
Well, I replicated that in half an hour a couple of weeks ago
with just taking the data and putting it into a tool that could do
all of that. And so that's that challenge of of the skills you need
to be able to analyze the data. That challenge is getting easier
and easier. The gap is still what is the business problem we're
trying to solve.
Yeah. Yeah. And I suppose what we can do in the next podcast is is
start to look at what best practice might be in those areas um and
and really unpick what that should look like and what the data um
estate could be, what tools you could use and and how you could
actually use that in the real world.
And we should also though Dan talk about what you can't do because
it isn't all sunshine and utopia.
Yeah, true.
It's also, you know, things that are tricky like the bullying
example. Um, and the reason to talk about the things you can't do
is to a not waste time. Uh, trying to tackle problems that are
difficult to solve when there are plenty that are easier to solve
that can have an immediate benefit.
And the second bit is there are times when you can't tackle a
problem now because you simply don't have the data. And so you
might make a decision which is this is really important to us. and
therefore we're going to start collecting the data so that in two
years time we can move on to tackling this problem. Bullying is a a
good example. If you're not collecting the data in a way that's
going to help you to predict bullying in the future, then maybe
understanding what you can't do and why you can't do it is the
lever to then say we're going to make a change to our practice so
that we can make this prediction in the future.
Yeah. And I I suppose coming up with that list to go right back to
the beginning to kind of come up with a list of what you actually
need what the business needs out of the system. What are the key
highlevel business objectives?
Yeah. And keeping that retained throughout the conversation so that
you know when you've achieved that goal.
Um now the problem is we can go quite deep sciency with the machine
learning and the predictive analytics. So maybe we should switch
across next time round to something
a bit more uh less sciency and a bit more customer centric. So
maybe talk about conversational interfaces, chat bots, whatever you
want to robots, whatever you want to talk about, call them. That
bit about well, how do you deliver services in that way? Let's talk
about that next and then we'll link the two topics together.
Fantastic. Looking forward to really
Okay. See you in a couple of weeks. Dan,
see you soon.