Preview Mode Links will not work in preview mode
Welcome to the AI in Education podcast With Dan Bowen and Ray Fleming. It's a weekly chat about Artificial Intelligence in Education for educators and education leaders. Also available through Apple Podcasts and Spotify. "This podcast is co-hosted by an employee of Microsoft Australia & New Zealand, but all the views and opinions expressed on this podcast are their own.”

Dec 2, 2020

Today Lee and Dan interview Auda Eltahla, Medical Genomics Research at Microsoft. We look at the vast quantities of data and the use of AI to crack the genome codes and develop new vaccines. 

Useful links: DNA Storage - Microsoft Research

 

________________________________________

TRANSCRIPT For this episode of The AI in Education Podcast
Series: 3
Episode: 13

This transcript was auto-generated. If you spot any important errors, do feel free to email the podcast hosts for corrections.

 

 

 

 

Hi, welcome to the AI education podcast. How are you, Lee?
I'm good, Dan. I'm good, Dan. It feels like we haven't been together for a while. It's uh good to be back on the air.
It sure is. Well, today's episode we got a fantastic one which we're going to focus on which we haven't done before, which is about research and computing around that that area with artificial intelligence and the way high performance compute and machine learning is affecting the way researchers are working and I suppose it's quite apt in this current environment with co so before we get into the fascinating topic of research let's have a listen what's top of mind in the news have you seen anything recently
yeah absolutely Dan and I know we we made that commitment didn't we in an episode or two ago to think more about what's going on in the news because there's so much happening um so yeah I've got a couple of things in fact one that I know you know will know very well um being your role in the education guys, was we we announced the winners of the the Microsoft AI for good schools challenge just last. Were you there, Dan? Did you get involved?
I wasn't. I I followed all the blog posts and things and I helped some of the schools and mentored them through. But yeah, know some amazing ideas that come out of of the students and the schools across Australia. Did any impress you when you were judging?
Look, yeah, so many of them. I think the thing that's I mean, broadly speaking, so I got I was lucky. I got to judge the uh Western Australia uh contingent of the the two divisions. So division one students years 7 to n division two students 10 to 12. Um and you know what look what fascinates me and we'll talk about the winners in a second I think but you you see these things Dan you talk about this and you realize these are just you know with respect these are kids these are kids who are still forming them they're thinking about the world and they're in school but they know so much and they think so much and they understand so much about what is
important to people you know because the idea is AI for good. It's about how do you have how does AI help people? It's just fascinating to think about the way that they just see the world in a in a in a in a unique way.
I know. And for me, like the younger kids as well, you know, when I'm looking through some of the finalists here for the division one, the younger kids, um, you know, hugs for epilepsy, you know, an AI enabled teddy bear, shark detector with bubbles to save shark nets and, you know, drones under the water to fire bubbles at sharks to put them off. Like phenomenal like, you know, allergy watches like really like cutting edge stuff and and really kind of you you could see they care a lot about things that are happening in the environment.
It's that I think that's the thing. It's that caring and I've said this before and I said it when I was on the interviewing panel with with a judging panel is you just get this sense that you know for all the things we hear about the world being a pretty abysmal place sometimes with all the bad things going on. You see these kids and their thoughts are really in just about doing good about how could we just help somebody and it doesn't have to be a big sector of societ But it's just, you know, how do we help a small group of the world to be slightly better off using technology? So, I think it's we're we're in a good place, Dan. You and I can retire well in a world looked after by by smart.
I know the one one of the ones that jumped up to me as well as Nitrobot, you know, and I'm reading one of the articles on a year and you know, the the description was using AI to analyze images of river systems collected by drone in an attempt to identify high concentrations of nitrogen and farms with excessive fertilizer runoff allowed the targeted intervention. to protect the Great Barrier Reef. I'm like, "Wow, that what that is like phenomenal." It really really is. You know, this isn't, you know, and I I think this AI for good program is going from strength to strength.
Oh, look, it is. And we I know that I mean the team that did it, J Mackerel and Travis and Changemakers, you amazing team created something amazing. But we should recognize the winners. That was the news. Sorry to the point. So, and I think what's really interesting this year is um
I I don't remember last year, but the winners this year, so there was two division one winner um uh h I can't pronounce it correctly alapolite uh which was from Ravenwood school for girls in Sydney and it was a way for people with physical disabilities uh or special notes to order clothes that fit now I've seen before some I some uh fashion companies that have done special clothes that um you people with say with um musculardrophe or cell uh um py cable palsy can you struggle to get these thing get closer they have automatic buttons and things. But these girls just thought about this idea of because technology can scan the body and understand what a unique person's unique. It made it unique. They thought about an individual's problem, not about a generically, you know, how do we make clothing that's suitable for for for less people. So that was great. But the one that really stuck out and I don't know if you saw it was actually um it made the news. It was on Channel 7 News and was, you know, made a major media impact. Uh was a young girl from a um I forget the school she was from now. I apologize, but her name was Christine Dougl
from Seven Hills High School. Yeah. Christine Duck who is who herself
is um she's visually impaired
and she's uh she wanted to play games, you know, and I can appreciate that. So, she came up with this whole idea of using AI and sort of haptic gloves and VR and tools to help her and others like her to uh to be included, you know, to be a part of the experience of playing games. And, you know, we've always talked about as a company the importance of of games of great connectors, you know, that bring people together, enables us to kind of normalize life and and society. So just amazing that someone who, you know, lives in that world, thought about the problem in such a big way and and a really deserving winner. So that was great news to see those out there and great news that the whole challenge
is going to go global next year. I think we talked about this at the event. This has been so successful in Australia
that it's going to happen globally and that's just a fantastic piece of news.
Oh, sure it is. I know. Anything else jumped up to you?
Uh yeah, just a couple of quick things. Um so Just this week, the uh human rights human rights commission of Australia, so uh Edward Santo, someone who we've worked with quite closely over time, uh they published a paper on recognizing and preventing AI biases. Now, this is kind of an interesting topic because we're always talking about AI ethics and bias and how we must always think about everybody. But what I liked about their paper is it says, okay, yes, you've got to do these things, but it also kind of showed you how to do it. Now, you know, it doesn't cover every scenario, but it helps us step a little bit out of the you know, this is important into the okay, this is how we do it. So, uh, we'll put a link in the show notes, but definitely great paper to go and have a read. I've had a look at it. Um, and some contributions from, you know, range of people, CSRO and and human rights. So, that was great to see that published
and the last one from me, Dan, for the news this week. Um, a little bit actually added, it was it was a couple of weeks back. Um, but I wanted to shine a light on a little bit of technology that we've just released, uh, we've just made available called Lobe, L O B. Now, Lobe is um addresses that problem of when you're trying to train a model, you know, when you're trying to train a machine learning model. Um that's kind of hard to do because it requires you to understand, you know, a the data you're looking at, the data sets, and a it needs you to understand the tools by which you're working on.
Um and so LOE is kind of a a simplification of that data classification, data labeling, and data training approach. So simplify taking the coding out of it. So you know, we talk a lot about low code, no code approaches to using technology. This is an example of how we're trying to think about that in the machine learning world. So again, we'll put a link in the show notes, but we announced that I think just at the end of October. Um, great little tool for for everybody to be able to understand and start.
Yeah, I'm looking at the article now as as you're talking and they're talking about like a backyard beekeeper utilizing machine learning to kind of uh organize and and kind of manage the stocks there of honey and things. It's fantastic.
It's it's it's like that citizen data science model, but actually when you get to the website listeners and you're looking at the the beautiful picture on the Did you see the video play when you landed on the website?
I know. It's it's a beautiful picture.
Yeah, it really is. And using like visual recognition and different, you know, looking for hornets are going into nests and things. It's fantastic.
It is. Yes, it's fascinating. So, that look there's lots more news, but that's the stuff that I thought was top of mind and uh we should get on with the conversation.
Let's speak to um we've got Auda Ela coming to speak to us today from Microsoft. Now, he's got an interesting background. So, welcome Auda. Tell us a little bit about yourself and your background and how you got into the role you are currently.
Thank you. Thank you, Dan, and thank you, Lee, uh, for having me on this podcast. I'm really, uh, really excited actually about this. Um, uh, my I'll tell you my background. I come from a from a research background. So, uh, I actually spent about 10 years uh, doing research both uh, you know, I did my masters, my PhD, and I I was a full-time researcher for a bit before joining Microsoft. Um, and I was classically trained as a as a a biologist actually. So I was in the lab analyzing DNA and RNA uh and then taking it to taking that data to the computer and starting analyzing what does it actually mean? What does it what does it mean when we stretch that DNA um and how can we understand diseases? So I was focused on diabetes um and then um I was actually transitioned a little bit to understanding uh the viruses. So this is of course a very hot topic in 2020.
Um but uh I did my PhD on on uh these tiny small viruses called RNA viruses which are very simple packages of u nucleic acid. So they they're basically are small protein bubble uh and inside them they have this genetic code and of course everyone now is familiar with um SARS CO or the the virus that caused the COVID disease. Um these viruses are are just fascinating. There's such tiny u packages of nucleic acid but once they get in the body they actually start controlling the cell and then they they start you know making more and more of that virus. Um and what we what we were doing before before I joined Microsoft was actually taking that virus and starting to understand the genetic sequence. So if um I guess most people are familiar now with the the DNA or the basically the the molecule that basically controls how the cell functions uh and every cell of your body. Uh viruses do have that component. They do have um genetic material inside them. Um and what we did in the lab is we took those viruses, we started taking that nucleic acid inside the virus and we started analyzing it. We started seeing what does it actually um produce in terms of protein? How does it control the cell? And how does the cell how do we as human respond to those viruses? Um and it's all within within you know data. It's in the world of data.
And can I can I ask a question on that then because it's fascinating. and and I suppose like when you you're talking there there's all these parallels around data science and I know we call it data science but I suppose some of the terminology like sampling and and and you know hypothesis and things like that it all comes from that science background so this is really exciting but when you're talking about collecting that data how how does a scientist go about you know you're looking you forgive my ignorance here you're looking under the microscope and you're looking at a protein how do you get where does that process start with that data collection point are you Are you writing things down or are you collecting samples through other means?
Yeah. So, so that's a very good question and it's it's actually something that's faced now by basically everyone that does genomic uh research. So, um it we invented a way to sort of get that DNA sequence out. And if you know uh Dan, the the DNA or RNA is composed of basically four letters. The the the building blocks are a CG and T and it's all about how these are organized and how long they are is um that controls what gene what proteins are being made. Um and then that essentially translates to what you know eventually what a virus actually looks like or what a cell would do. Um so in the past I would say u 10 to 15 years the the technology that came into understanding or reading that genetic material has uh been revolutionized really. um you know it was it and and with that the cost of actually reading that genetic material has dropped significantly you know from from tens of thousands to now you can read a whole human genome uh with a thousand or even less dollars
right
um so but with that technology
yeah
uh came you know more indepth uh reading of that genetic material so you can actually now read uh you know every cell in your body then has three billion bases uh of genetic material every single cell. And now we can now have the technology to actually go and read not only dance genetic material but actually go into individual cells and read the genetic material of that individual cell. So you can imagine that requires that now when we were in the lab all of this data was coming to us from these massive machines, you know, multi-million dollar machines
and they were coming to us in text files. So all of a sudden we were faced we're lab scientists um in biology, you know, there's a whole new area came into uh a whole new science came into existence called bionformatics. So this this is the field of being able to interpret that data. What do you do with this this massive text files you know uh gigabytes and terabytes of f of um of u of data and all of a sudden we biologists needed to understand this and actually interpret this uh this data. So this is where it came where data science essentially came in. Sorry go ahead.
No no it's it's really Fascinating stuff uh outer and I kind of you I've never quite understood one principle so I don't want to take us too far off the AI story but what you're talking about the whole the fact that we can sequence the DNA we now understand something that is so simple in principle four four proteins I think is it correct there the four I remember the movie Gataga that's the only way I remember the four proteins um and we understand that and you talk about the complexity of the breakdown of any of the of those four pieces and you know the complexity of the cells in Dan's body so we understand all that yet we still struggle as a as a ization with diseases like cancer and other things that are cell level diseases. What what what's the gap? What how is it so that we can learn so much about DNA breakthroughs yet diseases like cancer still elude us uh in terms of understanding how they work?
Yeah, that's another very good question Lee and I I think this is um in in our optimism when we had uh you know the first human genome sequence the human genome project uh which was an international collaboration you know multi labs all around the world contributed to that sequencing. We thought we were going to crack it. We thought once we once we're able to read the DNA um of the entire like the whole genome and we know essentially what the building blocks and how that goes from DNA to RNA to protein which is the process of the cellular function. It goes DNA RNA and then protein. Um I think most scientists uh in trying to simplify uh you know life and essentially and trying to understand understand it. We thought we were going to solve it. That's it. Read DNA. We're going to understand all diseases. Cancer is gone. Uh, diabetes is solved. Um, it turns out that DNA is actually just a one piece of the puzzle. So, turns out that RNA is a whole other piece. Turns out that protein is a whole other piece. Um, and essentially going from DNA to RNA, there's a whole lot of regulation. Just because you have the same DNA doesn't mean that you will get the same RNA. And this is where where where we start getting into um uh you know controlling of the genome. How do you how do you control uh how does a cell control the genome going from DNA to RNA and to protein? And the same thing goes from RNA to protein. So how when just because you have the same RNA doesn't mean you have the same protein
and so there's a whole lot of complexity about how how that process essentially goes from DNA to protein and function and and vice versa. So then you we started discovering ing that actually some of those proteins go back to control the DNA and some of the RNAs go back to control the DNA. Um so I think I think that's really an exciting new field actually another thing that uh just before Johnny Microsoft has started doing in the lab is realizing actually Lee just because we can sequence your genome doesn't mean that every cell in your body has the same genome and even if they have the same DNA doesn't necessarily mean that they have the same RNA. So one of the most exciting fields that I started working on before joining Microsoft is actually being able to go into say your blood um um cells Lee and being able to isolate these cells one at a time and actually sequence their genome one at a time and you'd be surprised about the massive heterogeneity we call it which is the diversity of these cells essentially just you're looking at you could almost look at different people when you look at um individual cells
wow and I I suppose this is this is massive data right but the But what this is jumping at is the the data needs to further this field are just you talking about terabytes of text files there. But if every cell in the body is going to be slightly different or even vastly different then the sampling of that data and the data that we're going to bring in you know we can only solve that with technology.
Absolutely. Absolutely. And and and how much do you sample uh you know is dependent on technology. I mean only technology can can give you that resolution where people are now start talking about uh a cell atlas essentially sequencing cells from all over the body to be able to build the map of individual cells in the body.
Um but but you're absolutely right. I mean the challenge that comes with that I mean it's it's in one side it's very exciting because we can now get all this data on another side it's actually really challenging like what do we do with all this data? How do we even interpret all of these uh different genomes at the same time. So this is where we started um using AI. Actually AI is is is very useful in that space. I mean in the simplest form um we were doing um supervised clustering and unsupervised clustering. So this is what we started doing in the lab. You essentially without making any assumptions you can do you can put all these cells the data from all these cells together and then you can tell a computer actually I want you to cluster these cells basically separate them based on what you think or what the algorithm thinks there's similarities between them. So this is this is a form of machine learning called unsupervised clustering. And what it does is it separates say I I sequence um um um you know cells from your lung and also from your um uh different muscle tissues and blood and and it's surprising how it you know by putting that and letting AI take care of it you can actually see how different cells tend to cluster together. So to separate these uh blood cells from the lung cells and different even within the blood cells, different types of blood cells.
Wow.
Hey, so outer I have a question for you and and I didn't prepare this one for you, so I'm sorry if I'm going to throw you under the bus, but there's a project at Microsoft that I'm not you come across called project pal, which is this idea of using DNA encoding structure to store data.
And the idea, the principle being is that DNA encoding, the the the modeling of that is so densely packed, so tightly encoupled and so you're able to store so much data in a very small piece of DNA. So the principle being is that if we use that same model, could we store so much more data in an entirely more compressed way?
I get the idea of it, but I don't really understand the principles of it. Can you
for me, for my simple brain, explain how DNA is such a dense encoding of data approach?
Absolutely. Absolutely. So uh I think this is one that I I think honestly this is one of the most exciting projects coming and I really can't wait to see that uh come into actual, you know, into the market and people actually leveraging that technology. Um the idea is really quite simple. It's it's you know if you think about classical data and how it's how it's coded it's ones and zeros right. So so the possibilities you know that you get one zero or one one 0 0 1. Um if you start going to DNA you all of a sudden have a b c uh sorry a cg and t. If you go into DNA you've got a c and t. So you can imagine the possibilities now of actually encoding a message that you might have with four letters instead of two are incredibly more more complex and more vast. So
now and and with with that the technologies that I described earlier being able to read the DNA that has that you know that came at the perfect time for this for this data technology because we can now very cheaply read that DNA. So if we store the message that we want or the data that we want and in the form of DNA. And by the way, DNA can be synthesized. We can we can get in the lab and we can actually say, you know what, Lee, I want this message written in as a cct. I can go into the lab and I can make that message and store it.
Um, and and because it's molecular, it's it's very small in uh in volume. I've seen actually uh I'm not sure if you've seen that, Lee, but there's a there's a a picture of, you know, a Walmart warehouse converted into to a DNA tube essentially that you can hold in your hand
which is fascinating
but and it's hard to get your head around it but it's amazing to think that that's the compression ratio if you want to use a simple simple way of thinking about it
and I can tell you when working in the lab um you know mostly people that work in the lab handle invisible solutions they basically just transfer one solution to the other so we with we've we don't see DNA it's just so incredibly small and compact and the amount of data um that that could be code in that is incredibly vast.
So, so when we've got, so let's go back onto the story then. We've got this data. We've we've seen how it's been collected. We know there's vast amounts of it. We realize now that it's not just about that DNA, it's about the fact you got to sample like an entire body which has got billions and billions and I don't even know how many cells in in inside the body. So, we need to sample that again and we can use AI to almost fill in the gaps. So, what are the technologies? then uh do you find useful or you seeing being used to kind of support that research element going forward things like high performance compute and things?
Absolutely. This is actually where where my link to Microsoft came into play. So I I naturally started working with these massive data sets. I started learning the skills to to analyze these data sets. But then all of a sudden we were you know with these huge data sets uh we were like wait we can't actually do this analys on our laptops, you know, if you can imagine these these huge text files and the huge um sequences that we needed to analyze, these have to be loaded into memory. So all of a sudden we started talking about high performance computing. We needed machines with vast amounts of memory to be able to load this data there. And then once you once you've got this sequence, you essentially um you try to do, you know, one of two one of two things. You could try to compare it to other sequences out there. Uh so So you know we have reference sequences uh and data become more and more available. So you can try to compare that massive text file essentially that you received to sequences that are available on the web
and and and when when you're talking about those things just to give us a bit of sense of perspective here when you're talking about these large text files and the amount of memory the what what are we what are we looking at what are we looking at here? What kind of real figures are we looking at?
Yeah. So so I mean you know we were the typical desktop that that you would have on your it might have like I don't know um 16 gigs 32 gigs of RAM that all of a sudden was was you know incredibly small for trying to analyze this um these files. So all of a sudden we started talking about um you know 100 gigs of RAM uh and 200 gigs of RAM and and and even more um the technology has now and and you know we needed 64 cores or even even higher. So we started going into to the the the high performance most universities uh in Australia certainly but I think around the world um would have a a cluster of hardware compute you know this is high performance compute for people that just need big machines essentially um but then even that wasn't enough for us and in fact you know you can imagine the entire university is trying to leverage that that hardware so we started actually going into uh national computing uh infrastructure. So, uh most countries that focus on a lot on research would have like a computing infrastructure that's accessible to researchers,
but once again all of the researchers around that around Australia for sure are trying to access this national computer infrastructure. So, one of the challenges that we faced is you know we have to go into those uh into those machines and just wait. Sometimes we had a conference to present uh our results at or we had a paper or say a grant like we wanted As you you know uh Dan you have to as a researcher you have to publish analyze your data um analyze your data publish and then present at a conference apply for a grant to get the money and do that again. So we definitely have those those deadlines. So when it came to like um applying for a grant or publishing a paper we really needed access to these resources right then and there. Um and and with research um of course time is is really money because someone else could just find that discovery and it happened to all the time. In fact, it happened to us twice when I was a researcher. No.
So, this is this is where you go the cloud has become a really attractive solution because you know the the resources are right there especially the high performance computing. I can just go upload my data turn on that machine uh get as many machines as I want with as much as much RAM as much um u as many cores as I want and shut them down and all of a sudden I've got my results. Um so this is where we started learning a little bit about cloud computing. Of course, renting a virtual machine is the simplest form of of cloud computing. But once you get into that world, all of a sudden you discovered, hold on, I can actually leverage um um you know, massive data warehousing. I could store uh not only my sequence data set, but the patients that I'm collecting u the characteristics of the patients. And you know, there was there's this misconception that cloud is not um sensitive enough or is not secure. of storing this uh this data. But that's that's completely untrue. In fact, you know, it was storing that data on a on a PC on your on your um desktop when you're at uni is way less secure than putting it in the cloud. So outer that's really I mean obviously that's a lot of what you do in your job here at Microsoft now is of course helping universities and researchers understand that difference. So so you know as you've switched over from being on the research side to being now on the sort of technology side What's your what's your view of the state of of of research adoption of cloudscale compute of any cloud of any technology? Is it getting there? Are we moving forward? Is the progress being made do you think? I
I think absolutely there is progress. Uh I do think it's slower than than in other spaces I've seen. Um
is that is that here in Australia only or um globally?
No, I think it's a worldwide I think it's a worldwide problem. Um u and it's it's it's mostly because I would say that There's a few um you know news articles that were published about the sensitivity of the cloud and and issues about leaking you know whether it's leaking images or leaking data uh all around the world that definitely when you're a researcher and you're in charge of patient information you really you really feel responsible for that information and you're you know we try as much as possible to deidentify that information but as you know Lee it's that that's it it's still prone to error, right? So, you can still make mistakes. Um, and and of course, you're also getting funding from the government, you know, taxpayers dollars. So, you really feel like you're responsible for all this data. Uh, so one hand, you're trying to protect the patients that you're that you're um trying to study and on the other hand, you're trying to, you know, save on the on the the cost and also just making sure that you don't lose this data. So, we felt very comfortable just putting our data on little hard drives and putting in the drawer and at least we can see them right there.
Yeah. It's that idea that you've got it in your hand so therefore it's secure which of course we know is not always the case but that's the mindset. Yeah.
And I was going to say I think combining that with what what we were hearing about the cloud and the the vulnerabilities in the cloud or at least what we saw um that sort of just com in the like mixed together. We just we felt a little bit um um you know I think scared of the cloud just put our data on the cloud, we're going to lose control of that data. We don't know where it lives. Um, and and we don't know who has access to it.
I suppose when we when we look at finance, the financial institutions of the world, you know, it was all about who could do that operation the quickest, you know, who was closest to the exchange. I want to put money on this particular stock or share, you know, nanoconds before somebody else. I want to buy it quickly and take it out quickly. So, it sounds like the same race in science. So, two things I'd like to ask you there. Firstly, I suppose when you're doing the research and you talked about presenting back there quickly your papers you know is there a legitimacy when you've done a lot of the analysis with compute so you know if you go into you know in front of a board and you say I've done this analysis and I found X and I did that using artificial intelligence for example do they then say well you need to be legitimate in what type of machine learning you've used or you know you know because I could present any data back to somebody using a machine learning algorithm and it could be wrong. And then also, you know, in the second angle to that is also when our our universities and things then looking at an array of different um companies because everybody's innovating to the top I suppose and you know one day Microsoft might be doing something amazing with the high compon performance comput and the next thing Google might and next thing Amazon might. So how do we really get the legitimacy into the
uh results and And also what's the selection of tools that we'd use?
Absolutely. Now that that's a great question and I think it's it's absolutely on top of mind for um particularly for people. I mean reproducibility is is is a huge part of science. Uh being able to do an experiment and publish the results and saying you know making a massive claim is not enough really to be to be accepted by the community. Someone else needs to be able to do exactly the same experiment get the same conclusions if they use the same tools. Right.
Mhm.
So, and and now with this with this explosion of data and particularly in genomic data sets that really goes all the way from the sample that we're getting to the data and then the tools as you said that we're used to analyze this data as well as the hardware or and and in in our case you know the cloud resources that we use. So um there's two things I think to say here. The tools um are probably the most comp licated uh thing there is there's definitely efforts to standardize the tools uh particularly in uh in biopformatics but I'm sure all across research uh because the these tools came you know uh in an organic response to the explosion of the data sets of the or the volume of their data sets you know all of a sudden we scientists we were faced with these massive data sets and we had to invent the tools and if we knew a little bit of Python or R we can just make these tools on the go so So I think classically um at least when genomic when genomics as a field um became a little bit more widely available. Um people were not trained as you know programmers or software engineers. They were just people that knew Python and developed and amazing software. But then then came the problem of hold on actually how do we make sure that these tools are available and if we use the same tools if if the different researchers whether actually using the same tools because if you use different tools you will get different results. So um there was efforts by you know the the likes of the broad institute in uh in the United States and in fact Microsoft has actually worked with them to get these tools sort of um you know contained in a service we call it Microsoft genomics service but it's actually you know work being done to to expand on that and and that's just essentially a set of parameters or a set of tools that were stitched together that if you use them you should be you should get the same results. So that's that's the one thing about the tool Now with the with the hardware and the soft and the and the you know the the pack the packaging of it um there's certainly now talks about like what about what what version of this uh of this um uh software you using or what version of the what hardware or how much RAM are you using does that actually make a difference um and and so you know with that I think with the combination there's that effort of containerization so I think all big big cloud providers, they provide these containerized um containerization tools and ways to execute them. I think that has massively helped the scientific community because you can actually take that set of tools, put it and if I say I use this hard way, you should be able to use the same you should be able to get the same results.
I think it's a it's a really interesting area outer and I think if I'm thinking about this from an AI perspective, I I think a lot about that responsible, ethical, transparent, accountable AI world that we you know that we want to make sure is part of the world we live in and you think about uh when you're talking about research work and the the fact that research is transparent and peer reviewable and repeatable and it needs to be proven by multiple people to be validated as good outcome yet kind of one of the problems of AI is we create a model that's data and then we allow it to figure out learn from the two to create a different outcome so how do you what's your view on you know is AI a good ethical way to create peer reviewable transparent scientific research because they seem to be almost at loggerheads in that one expects continuous similar results, one expects a changing outcome based on the data you feed it.
Yeah. And I think that's I mean if you know at the end of the day AI is is is an algorithm that that does you know ingest data and gives you predictions or or does you know you know um presents data somehow that you you just didn't see it before. Um and I I do think it's it's slowly coming into the particularly in the genomics space we are now seeing tools uh that are leveraging AI to sort of read classically most of the tools were were just high performance computing they leverage high memory and just crunching numbers um now that AI is coming into into play I think um we do see people actually you know there there are some tools that are gaining popularity and I think the nice thing about um um you know with with science is that most of this is open source and and I think this is one of the I'm actually very proud in Microsoft that we've endorsed open source to the level that we have you know GitHub being being um uh you know acquired by Microsoft it it just means that you know and we as scientists we publish all of our source code uh most of our source code in in um mostly in GitHub. So I I think not only do you have access to the data which has now become a requirement so So the the if you if you actually need to publish your your science, most journals now require to actually make your data available. So you go from scratch, give us that data and give us to the rest of the scientists. But actually the tools and the algorithms that you use to analyze this data, they have to be made available.
Um it's about disclosure of everything then. It's putting the tools and the data out there, not just the the findings.
Absolutely.
And you know the the there is a there is a massive incentive for scientists as well to um to make their tools available. I think you know it it comes all through uh the scientific history that the tools uh used for science are actually more um um you know more popular than than what most people think because you know it's a tool and when a when a when a tool is easy to use is widely available then you know scientists go by citations. So all of a sudden you see all these scientists all around the world using your tool, referencing your work, that's that's great uh validation that it's actually a good and and you know polished tool. So, so making that available I think particularly for AI um I I'm very optimistic about it
and and the last question for me then you mentioned tools there you know I suppose we trying to democratize a lot of AI and and bring out things like um you know auto machine learning autoML and you know lots of those kind of drag and drop interfaces the fascination for me when you mentioned Python and R earlier on like one question is why did why did we start using Python and R because I used to teach kids that at school and Python for me was like one of the worst programming languages because you can like declare a variable and then change it halfway through and it just wasn't like very structured. So why did we land on Python and R and then are there any other tools that are democratizing uh this field for you?
Yeah, not it's a it's a fascinating topic and I I think you know throughout my time in research I I saw the popularity of Python only going up. Uh I have to say we didn't I didn't know about it until like a few years into actually analyzing my my genomic data sets. Um but I really think it comes from the field of of from data science in general. I think it it it it was very powerful. It was open first of all.
Uh it was very powerful to be able to crunch numbers and build uh models. And I think that's exactly what the what the scientists wanted at the time. Um uh I think being able to um uh to to build your or package your your code into you know these little modules and Python was hugely popular as as you know Dan now it's become like the building block of of uh most machine learning algorithms is all these you know little packages in in Python and I think that was that was hugely popular. So um in and in in saying that I think uh There is also the world of um of R and you know statistical analysis also gained with all this data just statistical analysis beca came naturally into play and I think
all the statistical packages that were available in these languages um just made them a natural selection for these for these types of research
and what about the other uh you know interfaces that are coming up is that helping you know scientists and researchers you know with the interfaces you doing with drag and drop and you know all of these kind of things Yeah, I mean I I I definitely think think it is helping because um you know not everyone's going to be a hardcore machine uh machine learning scientist or a or a data scientist especially because we do need people that understand the biology and and you can imagine how complex it is to be able to understand the biology as well as the the the data science um field. So you do need that breakdown and and and I think that making machine learning accessible to the different skill levels um um is incredibly useful. So I think I think of myself I I I'm trained as I said in in the lab. So I was in the lab preparing and handling samples and DNA but getting into the machine learning space with that drag and drop um I think is incredibly useful. I can all of a sudden actually build machine learning algorithms that I couldn't do before. And I think the barrier to entry to learn Python you know properly and build your algorithms in Python um is too high. So I I definitely would have been able to do the the same experiments that I would have done if um if I didn't use these drag and drop tools. Um in saying that I think for the for the hardcore data scientists being able to leverage you know um the Azure machine learning platform does does enable you to focus on you know focus on your code write your code don't worry about the underlying hardware. So the Azure machine learning actually takes you know gets you to focus a little bit more what you should focus on which is the science.
Yeah.
You shouldn't really worry about the the what like how long is the machine running? Did I actually turn it on and off? Um and then where's the the the algorithms that I just built? Where is it stored? How do I deploy it outside? Those are all boring IT questions. You know for for a scientist who's focused on on um on building an algorithm to help their research um um that Azure Azure machine learning does enable you know it actually focuses you more on what you what you want to do.
Well I this has been fascinating today. Thank you so much for sharing uh your your research thoughts and the technologies behind that. Really opened my eyes to a lot of these technologies in the real world as well.
I I have so many more questions I could be asking out. You're going to get emails from me now on all sorts of stuff. So thank you for uh for opening our eyes to just how much you have.
Yeah. Thank you so much. Thank you so much for having All the best.
It's been fun.
Thank you. Thanks, Adam.