Dec 2, 2020
Today Lee and Dan interview Auda Eltahla, Medical Genomics Research at Microsoft. We look at the vast quantities of data and the use of AI to crack the genome codes and develop new vaccines.
Useful links: DNA Storage - Microsoft Research
________________________________________
TRANSCRIPT For this episode of The AI in Education Podcast
Series: 3
Episode: 13
This transcript was auto-generated. If you spot any important errors, do feel free to email the podcast hosts for corrections.
Hi, welcome to the AI education podcast. How are you, Lee?
I'm good, Dan. I'm good, Dan. It feels like we haven't been
together for a while. It's uh good to be back on the air.
It sure is. Well, today's episode we got a fantastic one which
we're going to focus on which we haven't done before, which is
about research and computing around that that area with artificial
intelligence and the way high performance compute and machine
learning is affecting the way researchers are working and I suppose
it's quite apt in this current environment with co so before we get
into the fascinating topic of research let's have a listen what's
top of mind in the news have you seen anything recently
yeah absolutely Dan and I know we we made that commitment didn't we
in an episode or two ago to think more about what's going on in the
news because there's so much happening um so yeah I've got a couple
of things in fact one that I know you know will know very well um
being your role in the education guys, was we we announced the
winners of the the Microsoft AI for good schools challenge just
last. Were you there, Dan? Did you get involved?
I wasn't. I I followed all the blog posts and things and I helped
some of the schools and mentored them through. But yeah, know some
amazing ideas that come out of of the students and the schools
across Australia. Did any impress you when you were judging?
Look, yeah, so many of them. I think the thing that's I mean,
broadly speaking, so I got I was lucky. I got to judge the uh
Western Australia uh contingent of the the two divisions. So
division one students years 7 to n division two students 10 to 12.
Um and you know what look what fascinates me and we'll talk about
the winners in a second I think but you you see these things Dan
you talk about this and you realize these are just you know with
respect these are kids these are kids who are still forming them
they're thinking about the world and they're in school but they
know so much and they think so much and they understand so much
about what is
important to people you know because the idea is AI for good. It's
about how do you have how does AI help people? It's just
fascinating to think about the way that they just see the world in
a in a in a in a unique way.
I know. And for me, like the younger kids as well, you know, when
I'm looking through some of the finalists here for the division
one, the younger kids, um, you know, hugs for epilepsy, you know,
an AI enabled teddy bear, shark detector with bubbles to save shark
nets and, you know, drones under the water to fire bubbles at
sharks to put them off. Like phenomenal like, you know, allergy
watches like really like cutting edge stuff and and really kind of
you you could see they care a lot about things that are happening
in the environment.
It's that I think that's the thing. It's that caring and I've said
this before and I said it when I was on the interviewing panel with
with a judging panel is you just get this sense that you know for
all the things we hear about the world being a pretty abysmal place
sometimes with all the bad things going on. You see these kids and
their thoughts are really in just about doing good about how could
we just help somebody and it doesn't have to be a big sector of
societ But it's just, you know, how do we help a small group of the
world to be slightly better off using technology? So, I think it's
we're we're in a good place, Dan. You and I can retire well in a
world looked after by by smart.
I know the one one of the ones that jumped up to me as well as
Nitrobot, you know, and I'm reading one of the articles on a year
and you know, the the description was using AI to analyze images of
river systems collected by drone in an attempt to identify high
concentrations of nitrogen and farms with excessive fertilizer
runoff allowed the targeted intervention. to protect the Great
Barrier Reef. I'm like, "Wow, that what that is like phenomenal."
It really really is. You know, this isn't, you know, and I I think
this AI for good program is going from strength to strength.
Oh, look, it is. And we I know that I mean the team that did it, J
Mackerel and Travis and Changemakers, you amazing team created
something amazing. But we should recognize the winners. That was
the news. Sorry to the point. So, and I think what's really
interesting this year is um
I I don't remember last year, but the winners this year, so there
was two division one winner um uh h I can't pronounce it correctly
alapolite uh which was from Ravenwood school for girls in Sydney
and it was a way for people with physical disabilities uh or
special notes to order clothes that fit now I've seen before some I
some uh fashion companies that have done special clothes that um
you people with say with um musculardrophe or cell uh um py cable
palsy can you struggle to get these thing get closer they have
automatic buttons and things. But these girls just thought about
this idea of because technology can scan the body and understand
what a unique person's unique. It made it unique. They thought
about an individual's problem, not about a generically, you know,
how do we make clothing that's suitable for for for less people. So
that was great. But the one that really stuck out and I don't know
if you saw it was actually um it made the news. It was on Channel 7
News and was, you know, made a major media impact. Uh was a young
girl from a um I forget the school she was from now. I apologize,
but her name was Christine Dougl
from Seven Hills High School. Yeah. Christine Duck who is who
herself
is um she's visually impaired
and she's uh she wanted to play games, you know, and I can
appreciate that. So, she came up with this whole idea of using AI
and sort of haptic gloves and VR and tools to help her and others
like her to uh to be included, you know, to be a part of the
experience of playing games. And, you know, we've always talked
about as a company the importance of of games of great connectors,
you know, that bring people together, enables us to kind of
normalize life and and society. So just amazing that someone who,
you know, lives in that world, thought about the problem in such a
big way and and a really deserving winner. So that was great news
to see those out there and great news that the whole challenge
is going to go global next year. I think we talked about this at
the event. This has been so successful in Australia
that it's going to happen globally and that's just a fantastic
piece of news.
Oh, sure it is. I know. Anything else jumped up to you?
Uh yeah, just a couple of quick things. Um so Just this week, the
uh human rights human rights commission of Australia, so uh Edward
Santo, someone who we've worked with quite closely over time, uh
they published a paper on recognizing and preventing AI biases.
Now, this is kind of an interesting topic because we're always
talking about AI ethics and bias and how we must always think about
everybody. But what I liked about their paper is it says, okay,
yes, you've got to do these things, but it also kind of showed you
how to do it. Now, you know, it doesn't cover every scenario, but
it helps us step a little bit out of the you know, this is
important into the okay, this is how we do it. So, uh, we'll put a
link in the show notes, but definitely great paper to go and have a
read. I've had a look at it. Um, and some contributions from, you
know, range of people, CSRO and and human rights. So, that was
great to see that published
and the last one from me, Dan, for the news this week. Um, a little
bit actually added, it was it was a couple of weeks back. Um, but I
wanted to shine a light on a little bit of technology that we've
just released, uh, we've just made available called Lobe, L O B.
Now, Lobe is um addresses that problem of when you're trying to
train a model, you know, when you're trying to train a machine
learning model. Um that's kind of hard to do because it requires
you to understand, you know, a the data you're looking at, the data
sets, and a it needs you to understand the tools by which you're
working on.
Um and so LOE is kind of a a simplification of that data
classification, data labeling, and data training approach. So
simplify taking the coding out of it. So you know, we talk a lot
about low code, no code approaches to using technology. This is an
example of how we're trying to think about that in the machine
learning world. So again, we'll put a link in the show notes, but
we announced that I think just at the end of October. Um, great
little tool for for everybody to be able to understand and
start.
Yeah, I'm looking at the article now as as you're talking and
they're talking about like a backyard beekeeper utilizing machine
learning to kind of uh organize and and kind of manage the stocks
there of honey and things. It's fantastic.
It's it's it's like that citizen data science model, but actually
when you get to the website listeners and you're looking at the the
beautiful picture on the Did you see the video play when you landed
on the website?
I know. It's it's a beautiful picture.
Yeah, it really is. And using like visual recognition and
different, you know, looking for hornets are going into nests and
things. It's fantastic.
It is. Yes, it's fascinating. So, that look there's lots more news,
but that's the stuff that I thought was top of mind and uh we
should get on with the conversation.
Let's speak to um we've got Auda Ela coming to speak to us today
from Microsoft. Now, he's got an interesting background. So,
welcome Auda. Tell us a little bit about yourself and your
background and how you got into the role you are currently.
Thank you. Thank you, Dan, and thank you, Lee, uh, for having me on
this podcast. I'm really, uh, really excited actually about this.
Um, uh, my I'll tell you my background. I come from a from a
research background. So, uh, I actually spent about 10 years uh,
doing research both uh, you know, I did my masters, my PhD, and I I
was a full-time researcher for a bit before joining Microsoft. Um,
and I was classically trained as a as a a biologist actually. So I
was in the lab analyzing DNA and RNA uh and then taking it to
taking that data to the computer and starting analyzing what does
it actually mean? What does it what does it mean when we stretch
that DNA um and how can we understand diseases? So I was focused on
diabetes um and then um I was actually transitioned a little bit to
understanding uh the viruses. So this is of course a very hot topic
in 2020.
Um but uh I did my PhD on on uh these tiny small viruses called RNA
viruses which are very simple packages of u nucleic acid. So they
they're basically are small protein bubble uh and inside them they
have this genetic code and of course everyone now is familiar with
um SARS CO or the the virus that caused the COVID disease. Um these
viruses are are just fascinating. There's such tiny u packages of
nucleic acid but once they get in the body they actually start
controlling the cell and then they they start you know making more
and more of that virus. Um and what we what we were doing before
before I joined Microsoft was actually taking that virus and
starting to understand the genetic sequence. So if um I guess most
people are familiar now with the the DNA or the basically the the
molecule that basically controls how the cell functions uh and
every cell of your body. Uh viruses do have that component. They do
have um genetic material inside them. Um and what we did in the lab
is we took those viruses, we started taking that nucleic acid
inside the virus and we started analyzing it. We started seeing
what does it actually um produce in terms of protein? How does it
control the cell? And how does the cell how do we as human respond
to those viruses? Um and it's all within within you know data. It's
in the world of data.
And can I can I ask a question on that then because it's
fascinating. and and I suppose like when you you're talking there
there's all these parallels around data science and I know we call
it data science but I suppose some of the terminology like sampling
and and and you know hypothesis and things like that it all comes
from that science background so this is really exciting but when
you're talking about collecting that data how how does a scientist
go about you know you're looking you forgive my ignorance here
you're looking under the microscope and you're looking at a protein
how do you get where does that process start with that data
collection point are you Are you writing things down or are you
collecting samples through other means?
Yeah. So, so that's a very good question and it's it's actually
something that's faced now by basically everyone that does genomic
uh research. So, um it we invented a way to sort of get that DNA
sequence out. And if you know uh Dan, the the DNA or RNA is
composed of basically four letters. The the the building blocks are
a CG and T and it's all about how these are organized and how long
they are is um that controls what gene what proteins are being
made. Um and then that essentially translates to what you know
eventually what a virus actually looks like or what a cell would
do. Um so in the past I would say u 10 to 15 years the the
technology that came into understanding or reading that genetic
material has uh been revolutionized really. um you know it was it
and and with that the cost of actually reading that genetic
material has dropped significantly you know from from tens of
thousands to now you can read a whole human genome uh with a
thousand or even less dollars
right
um so but with that technology
yeah
uh came you know more indepth uh reading of that genetic material
so you can actually now read uh you know every cell in your body
then has three billion bases uh of genetic material every single
cell. And now we can now have the technology to actually go and
read not only dance genetic material but actually go into
individual cells and read the genetic material of that individual
cell. So you can imagine that requires that now when we were in the
lab all of this data was coming to us from these massive machines,
you know, multi-million dollar machines
and they were coming to us in text files. So all of a sudden we
were faced we're lab scientists um in biology, you know, there's a
whole new area came into uh a whole new science came into existence
called bionformatics. So this this is the field of being able to
interpret that data. What do you do with this this massive text
files you know uh gigabytes and terabytes of f of um of u of data
and all of a sudden we biologists needed to understand this and
actually interpret this uh this data. So this is where it came
where data science essentially came in. Sorry go ahead.
No no it's it's really Fascinating stuff uh outer and I kind of you
I've never quite understood one principle so I don't want to take
us too far off the AI story but what you're talking about the whole
the fact that we can sequence the DNA we now understand something
that is so simple in principle four four proteins I think is it
correct there the four I remember the movie Gataga that's the only
way I remember the four proteins um and we understand that and you
talk about the complexity of the breakdown of any of the of those
four pieces and you know the complexity of the cells in Dan's body
so we understand all that yet we still struggle as a as a ization
with diseases like cancer and other things that are cell level
diseases. What what what's the gap? What how is it so that we can
learn so much about DNA breakthroughs yet diseases like cancer
still elude us uh in terms of understanding how they work?
Yeah, that's another very good question Lee and I I think this is
um in in our optimism when we had uh you know the first human
genome sequence the human genome project uh which was an
international collaboration you know multi labs all around the
world contributed to that sequencing. We thought we were going to
crack it. We thought once we once we're able to read the DNA um of
the entire like the whole genome and we know essentially what the
building blocks and how that goes from DNA to RNA to protein which
is the process of the cellular function. It goes DNA RNA and then
protein. Um I think most scientists uh in trying to simplify uh you
know life and essentially and trying to understand understand it.
We thought we were going to solve it. That's it. Read DNA. We're
going to understand all diseases. Cancer is gone. Uh, diabetes is
solved. Um, it turns out that DNA is actually just a one piece of
the puzzle. So, turns out that RNA is a whole other piece. Turns
out that protein is a whole other piece. Um, and essentially going
from DNA to RNA, there's a whole lot of regulation. Just because
you have the same DNA doesn't mean that you will get the same RNA.
And this is where where where we start getting into um uh you know
controlling of the genome. How do you how do you control uh how
does a cell control the genome going from DNA to RNA and to
protein? And the same thing goes from RNA to protein. So how when
just because you have the same RNA doesn't mean you have the same
protein
and so there's a whole lot of complexity about how how that process
essentially goes from DNA to protein and function and and vice
versa. So then you we started discovering ing that actually some of
those proteins go back to control the DNA and some of the RNAs go
back to control the DNA. Um so I think I think that's really an
exciting new field actually another thing that uh just before
Johnny Microsoft has started doing in the lab is realizing actually
Lee just because we can sequence your genome doesn't mean that
every cell in your body has the same genome and even if they have
the same DNA doesn't necessarily mean that they have the same RNA.
So one of the most exciting fields that I started working on before
joining Microsoft is actually being able to go into say your blood
um um cells Lee and being able to isolate these cells one at a time
and actually sequence their genome one at a time and you'd be
surprised about the massive heterogeneity we call it which is the
diversity of these cells essentially just you're looking at you
could almost look at different people when you look at um
individual cells
wow and I I suppose this is this is massive data right but the But
what this is jumping at is the the data needs to further this field
are just you talking about terabytes of text files there. But if
every cell in the body is going to be slightly different or even
vastly different then the sampling of that data and the data that
we're going to bring in you know we can only solve that with
technology.
Absolutely. Absolutely. And and and how much do you sample uh you
know is dependent on technology. I mean only technology can can
give you that resolution where people are now start talking about
uh a cell atlas essentially sequencing cells from all over the body
to be able to build the map of individual cells in the body.
Um but but you're absolutely right. I mean the challenge that comes
with that I mean it's it's in one side it's very exciting because
we can now get all this data on another side it's actually really
challenging like what do we do with all this data? How do we even
interpret all of these uh different genomes at the same time. So
this is where we started um using AI. Actually AI is is is very
useful in that space. I mean in the simplest form um we were doing
um supervised clustering and unsupervised clustering. So this is
what we started doing in the lab. You essentially without making
any assumptions you can do you can put all these cells the data
from all these cells together and then you can tell a computer
actually I want you to cluster these cells basically separate them
based on what you think or what the algorithm thinks there's
similarities between them. So this is this is a form of machine
learning called unsupervised clustering. And what it does is it
separates say I I sequence um um um you know cells from your lung
and also from your um uh different muscle tissues and blood and and
it's surprising how it you know by putting that and letting AI take
care of it you can actually see how different cells tend to cluster
together. So to separate these uh blood cells from the lung cells
and different even within the blood cells, different types of blood
cells.
Wow.
Hey, so outer I have a question for you and and I didn't prepare
this one for you, so I'm sorry if I'm going to throw you under the
bus, but there's a project at Microsoft that I'm not you come
across called project pal, which is this idea of using DNA encoding
structure to store data.
And the idea, the principle being is that DNA encoding, the the the
modeling of that is so densely packed, so tightly encoupled and so
you're able to store so much data in a very small piece of DNA. So
the principle being is that if we use that same model, could we
store so much more data in an entirely more compressed way?
I get the idea of it, but I don't really understand the principles
of it. Can you
for me, for my simple brain, explain how DNA is such a dense
encoding of data approach?
Absolutely. Absolutely. So uh I think this is one that I I think
honestly this is one of the most exciting projects coming and I
really can't wait to see that uh come into actual, you know, into
the market and people actually leveraging that technology. Um the
idea is really quite simple. It's it's you know if you think about
classical data and how it's how it's coded it's ones and zeros
right. So so the possibilities you know that you get one zero or
one one 0 0 1. Um if you start going to DNA you all of a sudden
have a b c uh sorry a cg and t. If you go into DNA you've got a c
and t. So you can imagine the possibilities now of actually
encoding a message that you might have with four letters instead of
two are incredibly more more complex and more vast. So
now and and with with that the technologies that I described
earlier being able to read the DNA that has that you know that came
at the perfect time for this for this data technology because we
can now very cheaply read that DNA. So if we store the message that
we want or the data that we want and in the form of DNA. And by the
way, DNA can be synthesized. We can we can get in the lab and we
can actually say, you know what, Lee, I want this message written
in as a cct. I can go into the lab and I can make that message and
store it.
Um, and and because it's molecular, it's it's very small in uh in
volume. I've seen actually uh I'm not sure if you've seen that,
Lee, but there's a there's a a picture of, you know, a Walmart
warehouse converted into to a DNA tube essentially that you can
hold in your hand
which is fascinating
but and it's hard to get your head around it but it's amazing to
think that that's the compression ratio if you want to use a simple
simple way of thinking about it
and I can tell you when working in the lab um you know mostly
people that work in the lab handle invisible solutions they
basically just transfer one solution to the other so we with we've
we don't see DNA it's just so incredibly small and compact and the
amount of data um that that could be code in that is incredibly
vast.
So, so when we've got, so let's go back onto the story then. We've
got this data. We've we've seen how it's been collected. We know
there's vast amounts of it. We realize now that it's not just about
that DNA, it's about the fact you got to sample like an entire body
which has got billions and billions and I don't even know how many
cells in in inside the body. So, we need to sample that again and
we can use AI to almost fill in the gaps. So, what are the
technologies? then uh do you find useful or you seeing being used
to kind of support that research element going forward things like
high performance compute and things?
Absolutely. This is actually where where my link to Microsoft came
into play. So I I naturally started working with these massive data
sets. I started learning the skills to to analyze these data sets.
But then all of a sudden we were you know with these huge data sets
uh we were like wait we can't actually do this analys on our
laptops, you know, if you can imagine these these huge text files
and the huge um sequences that we needed to analyze, these have to
be loaded into memory. So all of a sudden we started talking about
high performance computing. We needed machines with vast amounts of
memory to be able to load this data there. And then once you once
you've got this sequence, you essentially um you try to do, you
know, one of two one of two things. You could try to compare it to
other sequences out there. Uh so So you know we have reference
sequences uh and data become more and more available. So you can
try to compare that massive text file essentially that you received
to sequences that are available on the web
and and and when when you're talking about those things just to
give us a bit of sense of perspective here when you're talking
about these large text files and the amount of memory the what what
are we what are we looking at what are we looking at here? What
kind of real figures are we looking at?
Yeah. So so I mean you know we were the typical desktop that that
you would have on your it might have like I don't know um 16 gigs
32 gigs of RAM that all of a sudden was was you know incredibly
small for trying to analyze this um these files. So all of a sudden
we started talking about um you know 100 gigs of RAM uh and 200
gigs of RAM and and and even more um the technology has now and and
you know we needed 64 cores or even even higher. So we started
going into to the the the high performance most universities uh in
Australia certainly but I think around the world um would have a a
cluster of hardware compute you know this is high performance
compute for people that just need big machines essentially um but
then even that wasn't enough for us and in fact you know you can
imagine the entire university is trying to leverage that that
hardware so we started actually going into uh national computing uh
infrastructure. So, uh most countries that focus on a lot on
research would have like a computing infrastructure that's
accessible to researchers,
but once again all of the researchers around that around Australia
for sure are trying to access this national computer
infrastructure. So, one of the challenges that we faced is you know
we have to go into those uh into those machines and just wait.
Sometimes we had a conference to present uh our results at or we
had a paper or say a grant like we wanted As you you know uh Dan
you have to as a researcher you have to publish analyze your data
um analyze your data publish and then present at a conference apply
for a grant to get the money and do that again. So we definitely
have those those deadlines. So when it came to like um applying for
a grant or publishing a paper we really needed access to these
resources right then and there. Um and and with research um of
course time is is really money because someone else could just find
that discovery and it happened to all the time. In fact, it
happened to us twice when I was a researcher. No.
So, this is this is where you go the cloud has become a really
attractive solution because you know the the resources are right
there especially the high performance computing. I can just go
upload my data turn on that machine uh get as many machines as I
want with as much as much RAM as much um u as many cores as I want
and shut them down and all of a sudden I've got my results. Um so
this is where we started learning a little bit about cloud
computing. Of course, renting a virtual machine is the simplest
form of of cloud computing. But once you get into that world, all
of a sudden you discovered, hold on, I can actually leverage um um
you know, massive data warehousing. I could store uh not only my
sequence data set, but the patients that I'm collecting u the
characteristics of the patients. And you know, there was there's
this misconception that cloud is not um sensitive enough or is not
secure. of storing this uh this data. But that's that's completely
untrue. In fact, you know, it was storing that data on a on a PC on
your on your um desktop when you're at uni is way less secure than
putting it in the cloud. So outer that's really I mean obviously
that's a lot of what you do in your job here at Microsoft now is of
course helping universities and researchers understand that
difference. So so you know as you've switched over from being on
the research side to being now on the sort of technology side
What's your what's your view of the state of of of research
adoption of cloudscale compute of any cloud of any technology? Is
it getting there? Are we moving forward? Is the progress being made
do you think? I
I think absolutely there is progress. Uh I do think it's slower
than than in other spaces I've seen. Um
is that is that here in Australia only or um globally?
No, I think it's a worldwide I think it's a worldwide problem. Um u
and it's it's it's mostly because I would say that There's a few um
you know news articles that were published about the sensitivity of
the cloud and and issues about leaking you know whether it's
leaking images or leaking data uh all around the world that
definitely when you're a researcher and you're in charge of patient
information you really you really feel responsible for that
information and you're you know we try as much as possible to
deidentify that information but as you know Lee it's that that's it
it's still prone to error, right? So, you can still make mistakes.
Um, and and of course, you're also getting funding from the
government, you know, taxpayers dollars. So, you really feel like
you're responsible for all this data. Uh, so one hand, you're
trying to protect the patients that you're that you're um trying to
study and on the other hand, you're trying to, you know, save on
the on the the cost and also just making sure that you don't lose
this data. So, we felt very comfortable just putting our data on
little hard drives and putting in the drawer and at least we can
see them right there.
Yeah. It's that idea that you've got it in your hand so therefore
it's secure which of course we know is not always the case but
that's the mindset. Yeah.
And I was going to say I think combining that with what what we
were hearing about the cloud and the the vulnerabilities in the
cloud or at least what we saw um that sort of just com in the like
mixed together. We just we felt a little bit um um you know I think
scared of the cloud just put our data on the cloud, we're going to
lose control of that data. We don't know where it lives. Um, and
and we don't know who has access to it.
I suppose when we when we look at finance, the financial
institutions of the world, you know, it was all about who could do
that operation the quickest, you know, who was closest to the
exchange. I want to put money on this particular stock or share,
you know, nanoconds before somebody else. I want to buy it quickly
and take it out quickly. So, it sounds like the same race in
science. So, two things I'd like to ask you there. Firstly, I
suppose when you're doing the research and you talked about
presenting back there quickly your papers you know is there a
legitimacy when you've done a lot of the analysis with compute so
you know if you go into you know in front of a board and you say
I've done this analysis and I found X and I did that using
artificial intelligence for example do they then say well you need
to be legitimate in what type of machine learning you've used or
you know you know because I could present any data back to somebody
using a machine learning algorithm and it could be wrong. And then
also, you know, in the second angle to that is also when our our
universities and things then looking at an array of different um
companies because everybody's innovating to the top I suppose and
you know one day Microsoft might be doing something amazing with
the high compon performance comput and the next thing Google might
and next thing Amazon might. So how do we really get the legitimacy
into the
uh results and And also what's the selection of tools that we'd
use?
Absolutely. Now that that's a great question and I think it's it's
absolutely on top of mind for um particularly for people. I mean
reproducibility is is is a huge part of science. Uh being able to
do an experiment and publish the results and saying you know making
a massive claim is not enough really to be to be accepted by the
community. Someone else needs to be able to do exactly the same
experiment get the same conclusions if they use the same tools.
Right.
Mhm.
So, and and now with this with this explosion of data and
particularly in genomic data sets that really goes all the way from
the sample that we're getting to the data and then the tools as you
said that we're used to analyze this data as well as the hardware
or and and in in our case you know the cloud resources that we use.
So um there's two things I think to say here. The tools um are
probably the most comp licated uh thing there is there's definitely
efforts to standardize the tools uh particularly in uh in
biopformatics but I'm sure all across research uh because the these
tools came you know uh in an organic response to the explosion of
the data sets of the or the volume of their data sets you know all
of a sudden we scientists we were faced with these massive data
sets and we had to invent the tools and if we knew a little bit of
Python or R we can just make these tools on the go so So I think
classically um at least when genomic when genomics as a field um
became a little bit more widely available. Um people were not
trained as you know programmers or software engineers. They were
just people that knew Python and developed and amazing software.
But then then came the problem of hold on actually how do we make
sure that these tools are available and if we use the same tools if
if the different researchers whether actually using the same tools
because if you use different tools you will get different results.
So um there was efforts by you know the the likes of the broad
institute in uh in the United States and in fact Microsoft has
actually worked with them to get these tools sort of um you know
contained in a service we call it Microsoft genomics service but
it's actually you know work being done to to expand on that and and
that's just essentially a set of parameters or a set of tools that
were stitched together that if you use them you should be you
should get the same results. So that's that's the one thing about
the tool Now with the with the hardware and the soft and the and
the you know the the pack the packaging of it um there's certainly
now talks about like what about what what version of this uh of
this um uh software you using or what version of the what hardware
or how much RAM are you using does that actually make a difference
um and and so you know with that I think with the combination
there's that effort of containerization so I think all big big
cloud providers, they provide these containerized um
containerization tools and ways to execute them. I think that has
massively helped the scientific community because you can actually
take that set of tools, put it and if I say I use this hard way,
you should be able to use the same you should be able to get the
same results.
I think it's a it's a really interesting area outer and I think if
I'm thinking about this from an AI perspective, I I think a lot
about that responsible, ethical, transparent, accountable AI world
that we you know that we want to make sure is part of the world we
live in and you think about uh when you're talking about research
work and the the fact that research is transparent and peer
reviewable and repeatable and it needs to be proven by multiple
people to be validated as good outcome yet kind of one of the
problems of AI is we create a model that's data and then we allow
it to figure out learn from the two to create a different outcome
so how do you what's your view on you know is AI a good ethical way
to create peer reviewable transparent scientific research because
they seem to be almost at loggerheads in that one expects
continuous similar results, one expects a changing outcome based on
the data you feed it.
Yeah. And I think that's I mean if you know at the end of the day
AI is is is an algorithm that that does you know ingest data and
gives you predictions or or does you know you know um presents data
somehow that you you just didn't see it before. Um and I I do think
it's it's slowly coming into the particularly in the genomics space
we are now seeing tools uh that are leveraging AI to sort of read
classically most of the tools were were just high performance
computing they leverage high memory and just crunching numbers um
now that AI is coming into into play I think um we do see people
actually you know there there are some tools that are gaining
popularity and I think the nice thing about um um you know with
with science is that most of this is open source and and I think
this is one of the I'm actually very proud in Microsoft that we've
endorsed open source to the level that we have you know GitHub
being being um uh you know acquired by Microsoft it it just means
that you know and we as scientists we publish all of our source
code uh most of our source code in in um mostly in GitHub. So I I
think not only do you have access to the data which has now become
a requirement so So the the if you if you actually need to publish
your your science, most journals now require to actually make your
data available. So you go from scratch, give us that data and give
us to the rest of the scientists. But actually the tools and the
algorithms that you use to analyze this data, they have to be made
available.
Um it's about disclosure of everything then. It's putting the tools
and the data out there, not just the the findings.
Absolutely.
And you know the the there is a there is a massive incentive for
scientists as well to um to make their tools available. I think you
know it it comes all through uh the scientific history that the
tools uh used for science are actually more um um you know more
popular than than what most people think because you know it's a
tool and when a when a when a tool is easy to use is widely
available then you know scientists go by citations. So all of a
sudden you see all these scientists all around the world using your
tool, referencing your work, that's that's great uh validation that
it's actually a good and and you know polished tool. So, so making
that available I think particularly for AI um I I'm very optimistic
about it
and and the last question for me then you mentioned tools there you
know I suppose we trying to democratize a lot of AI and and bring
out things like um you know auto machine learning autoML and you
know lots of those kind of drag and drop interfaces the fascination
for me when you mentioned Python and R earlier on like one question
is why did why did we start using Python and R because I used to
teach kids that at school and Python for me was like one of the
worst programming languages because you can like declare a variable
and then change it halfway through and it just wasn't like very
structured. So why did we land on Python and R and then are there
any other tools that are democratizing uh this field for you?
Yeah, not it's a it's a fascinating topic and I I think you know
throughout my time in research I I saw the popularity of Python
only going up. Uh I have to say we didn't I didn't know about it
until like a few years into actually analyzing my my genomic data
sets. Um but I really think it comes from the field of of from data
science in general. I think it it it it was very powerful. It was
open first of all.
Uh it was very powerful to be able to crunch numbers and build uh
models. And I think that's exactly what the what the scientists
wanted at the time. Um uh I think being able to um uh to to build
your or package your your code into you know these little modules
and Python was hugely popular as as you know Dan now it's become
like the building block of of uh most machine learning algorithms
is all these you know little packages in in Python and I think that
was that was hugely popular. So um in and in in saying that I think
uh There is also the world of um of R and you know statistical
analysis also gained with all this data just statistical analysis
beca came naturally into play and I think
all the statistical packages that were available in these languages
um just made them a natural selection for these for these types of
research
and what about the other uh you know interfaces that are coming up
is that helping you know scientists and researchers you know with
the interfaces you doing with drag and drop and you know all of
these kind of things Yeah, I mean I I I definitely think think it
is helping because um you know not everyone's going to be a
hardcore machine uh machine learning scientist or a or a data
scientist especially because we do need people that understand the
biology and and you can imagine how complex it is to be able to
understand the biology as well as the the the data science um
field. So you do need that breakdown and and and I think that
making machine learning accessible to the different skill levels um
um is incredibly useful. So I think I think of myself I I I'm
trained as I said in in the lab. So I was in the lab preparing and
handling samples and DNA but getting into the machine learning
space with that drag and drop um I think is incredibly useful. I
can all of a sudden actually build machine learning algorithms that
I couldn't do before. And I think the barrier to entry to learn
Python you know properly and build your algorithms in Python um is
too high. So I I definitely would have been able to do the the same
experiments that I would have done if um if I didn't use these drag
and drop tools. Um in saying that I think for the for the hardcore
data scientists being able to leverage you know um the Azure
machine learning platform does does enable you to focus on you know
focus on your code write your code don't worry about the underlying
hardware. So the Azure machine learning actually takes you know
gets you to focus a little bit more what you should focus on which
is the science.
Yeah.
You shouldn't really worry about the the what like how long is the
machine running? Did I actually turn it on and off? Um and then
where's the the the algorithms that I just built? Where is it
stored? How do I deploy it outside? Those are all boring IT
questions. You know for for a scientist who's focused on on um on
building an algorithm to help their research um um that Azure Azure
machine learning does enable you know it actually focuses you more
on what you what you want to do.
Well I this has been fascinating today. Thank you so much for
sharing uh your your research thoughts and the technologies behind
that. Really opened my eyes to a lot of these technologies in the
real world as well.
I I have so many more questions I could be asking out. You're going
to get emails from me now on all sorts of stuff. So thank you for
uh for opening our eyes to just how much you have.
Yeah. Thank you so much. Thank you so much for having All the
best.
It's been fun.
Thank you. Thanks, Adam.