Jul 8, 2020
In this episode Dan talks through the machine learning process. What steps do we need? What data do we collect? And why does thinking about alcohol make this easier?
________________________________________
TRANSCRIPT For this episode of The AI in Education Podcast
Series: 3
Episode: 5
This transcript was auto-generated. If you spot any important errors, do feel free to email the podcast hosts for corrections.
Hi, welcome to the AI and education podcast. I'm Dan and Lee's
going to join us a little bit later to add some of his flavor to
machine learning and give us some more insights into how this
works. But what I thought we'd do today's episode is actually look
at machine learning first of all and think about the steps we need
to create create a machine learning model. So this is kind of
looking I suppose under the hood at machine learning. Let's start
at the basics. I suppose the world is is filled with lots of data.
Um whether that's music data, video data, documents, spreadsheets,
there's data everywhere. There's data collected from IoT and all
sorts of things. What machine learning brings to us is the promise
of analyzing that data and getting meaning from it. Because the
more data that's been collected, And it's being collected
everywhere nowadays from our cars to our refrigerators to our
fitness trackers on our uh watches to our mobile phones. There's
data coming in from a lot of different points. So the more data
that comes on, the more difficult it is to actually manually
process that data. What we've had to do over time, I suppose, is
actually manually rewrite rules to adapt systems based on the data
that we can process ourselves. But the more more and more data
becomes abundant, it makes it more difficult to manually create
those rules. So, think about machine learning as as not being a
dark art and this thing you need a maths degree for and you need to
go into major major detail. There are machine learning and data
scientists out there, but all it is is a set of tools and
technologies that you can utilize in your organizations to answer
questions with the data. There's a lot of debt data like we said
being generated not only by those people but devices and machines
and this is going to continue to grow uh and like I said we've
manually written rules to adapt to these systems at the minute and
the volume of this data is surpassing our um ability to manage this
data really well. So we've got to find a way to manage this now
automatically. So let's think of some examples. So image tagging
say if you're using some social media platforms like Facebook,
image tagging is is available in in platforms like that. And it's
interesting because a lot of that is automated these days. So
there's a lot of automatic tagging of tagging of images that
happens and that's using machine learning. Um things like Netflix
or Spotify that recommend playlists to us and recommend next videos
to watch. That's all machine learning algorithms in the back end
knowing what our preferences are. If we prefer action movies, for
example, and sharing with us what other people think are good and
and sharing those um insights with us to recommend us to watch the
next movie or listen to the next audio track or band. So search is
another good example using Bing or Google and line those searches
are machine learning algorithms and they are looking at the text of
what you're searching for and also they adjust your results
obviously based on the interests that you have where you are, your
location, and really trying to think about what you entered and
giving you good results there. So, for example, if you were
searching for something specific around um Java is a good example.
Are you looking for a coffee or you're looking for a programming
language or you're looking for uh a country to go on holidays? So,
you know, they adjust those search and optimize those search
interests based on machine learning. There's also lots of things is
in fraud detection and this can apply to things across multiple
industries including health care. Say for example for skin cancer
detection, self-p parking on your cars. There's a heap of uses of
machine learning. So what's the process for this? So let's look at
this example. Say if we're looking for a model to set up to tell
the difference between beer or wine, we can use machine learning to
help us with that. So if we are going to use machine learning for
these we need to test some variables obviously. So if we've got
beer and wine um we got to think about what variables we could test
and there's multiple variables we could test here. So let's just
use this example at the minute um say alcohol content and then the
actual color of it. Um how would you record those? So color would
be obviously through wavelength of light and alcohol percentage
would then probably be through here um using a hydrometer to find
out how strong a drink is. So you know how strong a drink is and
what color it is, then you should have two good indicators of what
that uh drink would be. There are multiple uh variables we could
use, but actually for this, we'll just use those two features of
the drinks. Okay, let's look at the steps of machine learning here.
Step one, we get the data from the bottle store. We've got the
bottles we've of wine and beer. We'll then use the spectrometer to
collect the waveform data um to get the color And then we'd get the
hydrometer to work out the alcohol percentage of each one and we
record those in a table. Nice and easy. Step two is to prepare that
data. So we don't want the order in which we collected these
things. Say if we did all the beers first, we don't want that to
mess up our uh training. So we'd actually randomize and prepare our
data. We want to determine that uh it's random and then we can
create some visualizations to help see any relationships. So for
example, if we've collected more wine than beer, then we want to
know that because we don't want any bias to appear in our training
model. So once we've actually collected that data, we then split it
into two parts and it's usually like a 7030 rule or an 8020 rule
where your training data might be 70% and 30% would be actually
test and evaluation data. You'd want to keep a clean set of data to
test and evaluate your model at the end. Uh you don't want to use
training data for that because it already knows the answer to that
data. So we'd put that test data aside and save that for later. So
now we've got our set of uh training data that we can use all
prepared randomly. Next thing we do is choose a model. Now there's
lots of models out there that are already created by Microsoft and
others for different types of um data. So there's models out there
which are good for sequences. There's uh models out that are good
for text. There's models out there that are good for music and
video and images. But for this we can use a simple data model to
compare two variables. Okay, so we pick our model and there's a
heap out there that you can use. Okay, step four. Now we've chosen
the model. Now we've got to get the training part done. I'll try to
explain that quite simply, but essentially imagine you've got a X
and Y axis on a graph and you can now start to plot that data. So
on the X- axis, say for example, you've got the alcohol percentage
and on the Y-axis you've got the color in wavelength. Um you can
sort of plot the different front uh sample data and it starts to
scatter across your graph. What the training data does is try to
separate that out um by putting a sloped line in and the training
data will go for each step and try to move that line to get a best
fit so that you know what type of uh uh drink is on either side of
the line. Which ones are beer and which ones are wine. So just that
slow of that line using the formula we all learned in school y = mx
plus b. It would uh adjust the slope of the line to best fit. So
you'd run through multiple steps. You know the first step might be
wildly inaccurate. It's just a random line through the beginning of
the the data right in the middle for example and it'll adjust that
until it gets more and more accurate and has a really good
representation of all of the wines on one side of the line and the
beers on the other side of the line. So that that inter line the
slope will adjust in each training step. So you get a really good
accurate model. So we know then when we plot the next um piece of
data if that data is on uh the upper part of the line it'll be beer
for example and the lower part of the line it'll be wine. So
hopefully you can visualize that. It's hard in a podcast I know but
the idea being that you put the data you scattered it across and
then you start to add your model. In this case it's going to be a
simple formula and a slope which is going to separate two sets of
data between beer and wine and that's adjusted in training steps to
go and and allow you to put data in. So step five is when you
evaluate this. So we grab the data we used earlier this the data we
set aside and we put that in and see if the answers that come out
are correct. We test it against the data that hasn't been used for
the training purposes and that's supposed to I suppose represent
the data in real life. Uh and again like I said you might 8020
depending on the size of the data you've got. If you've got a lot
of data, you might only need a small amount of training data, say
80% to 20% for training. But if you've got a say a smaller data
set, you might use 7030 just to get more accurate training data to
test your model. So once you've done that, you know that this is
pretty good, then your data scientist come into it or depending on
the algorithms and models you're using, you can do what's called
parameter tuning. Now I don't know a lot about this, but um essenti
what you're going in you can tune other variables. So for example
you know the data sets that we've got in for the the beers and
wines that we've collected for example we haven't included some of
the variables like zero what happens if something comes in where it
is purely black or purely white or water you know what what about
some of these outliers so it's thinking about tuning some of these
parameters so that you can actually really make that model more and
more accurate. Step seven is getting used to actually answering
some questions with these. So you this is when you realize the
value of all these steps we've done so far. So is this drink wine
or beer? And we can determine that by capturing the wavelength and
the alcohol percentage of a particular drink and putting that into
our model. So that's the fun starts and you can actually use that
on real real data. So let's just go through those seven steps
again. You gather the data in you then prepare that data. Make sure
that it's in the form format that um you need. Make sure that the
data is randomized and separated out between training data and your
um actual test data. Then you choose a model depending on the type
of data that you are trying to put machine in again. So that's
images or audio or text or numbers. Then you actually go through a
training process and you use um your training algorithms and go
through training steps to actually train that model. It gets more
and more accurate a bit like was driving a car. When you start to
learn to drive, you know, the more real world data you get in, the
more times you reverse park, the better you get at it. So, it
learns and learns and learns and trains and becomes more and more
accurate. After you've done that, you then use your evaluation data
which you took from the preparation stage to actually test with
some uh independent data uh against your algorithm. Then you can do
some hyperparameter training which is tweaking the edges of some of
these variables and then you can do some real world predictions and
some modeling. So hopefully that's taken us under the hood of what
machine learning is. You know, it isn't uh dark art. It's actually
tools and technologies that we can use to make sense of data
automatically when we get in lots and lots of data in. So I really
hope that's helped and I'm going to bring Lee back into the studio
and get his thoughts on some of the machine learning concepts and
some real world examples. So, what do you think, Lee?
Hey, Dan, it was great. Great uh great story. Great way to explain
it. I I got to ask why you went to alcohol as your first point of
preference for a model to work out the two things.
Maybe that says more about you than anything else.
Yeah, so it's that it's that process, isn't it, that that I think
when we think about machine learning, it seems like a dark art, but
actually um it's a logical process of steps that people have got to
go through.
Yeah. Look, that was what when I listened, you know, to kind of
where you took the story and and that was the bit that struck out
for me is it it is it's a bit mystical. We kind of think machine
learning it's this blackbox thing that computers do that nobody
else understands and there is some depth to it and complexity to it
for sure. But reality is when you break it down into those seven
steps as you did and create that that journey from getting data to
sort of evaluating the data to making a decision based on the data.
It's pretty simple in many ways. Um you know, and and and a great
example of how we can, you know, when you walk through those seven
steps you've got that anyone, you know, in education or in in in
any domain could quickly kind of start to learn a bit about this
process. And that that's what we want to get out of this, isn't
it?
Yeah. And interestingly, I suppose you just alluded to it, and I
haven't mentioned this at all yet, but you said right right in the
middle of that sentence that it was about how they use that data at
the end or those results of the the machine learning. And I suppose
being able to go back and justify the decisions you've made. If
it's telling me that, you know, a student or a particular type of
oil or whatever it might be has an impurity in it uh due to machine
learning, then you have to be able to step back through because you
might make a big million-doll decision or a decision based on a
student's life or somebody's health on that piece of machine
learning. So,
yeah,
that's a really interesting point as well, being able to go back
and kind of say, well, this is how I came up with that. decision,
right?
Well, and absolutely. And it's interesting when you think about the
seven steps that you talked about, and I'll refer back to the
middle of that block, which is I think it's step two, data
preparation, step three, choose the model, and then step four,
training. If you think about what that model actually is, the
preparation of the data, choosing a model, and training, that's
almost should be a cyclical process because that's the bit where
you you you introduce the data to the problem. And and you know,
data, as we've talked about, I guess in other conversations, you
data has these biases and has these challenges. you think about and
then the model and I think maybe next next maybe maybe next week we
need to really unpack that model piece and think about what is the
the way that models influence the way that data is used to create
the outcome because that's the training process you know a model
doing it and being trained in one way and the data that you feed it
creates the outcome and the evaluation as it were that you that you
link to so I think that you know that's where you guess there's a
bit to be unpacked there isn't it in that that kind of complex
middle bit
yeah yeah definitely definitely so let's catch up in the next
podcast episode then and really try to extrapolate this a bit
further. What do you think?
I'll I'll pull all that other quantum stuff from the one before out
of my head and let's fill it full of machine learning stuff. Let's
do it, Dan. Look forward to it.
Fantastic. Cheersy.
Thanks, Dan.