Physics-informed Machine Learning for Discovering Knowledge in Hydrology - Transcript

[00:00:00] Bridget Scanlon: Welcome to the Water Resources Podcast. I am Bridget Scanlon. In this podcast, we discuss water challenges with leading experts, including topics on extreme climate events, overexploitation, and potential solutions towards more sustainable management.

I would like to welcome Chaopeng Shen to the Water Resources Podcast. Chaopeng is a Professor at Penn State University and leads a group called Multiscale Hydrology Processes and Intelligence. His group focuses on data driven and process-based models to support decision making across multiple scales from catchment to global scales. Applications of his work include rainfall runoff routing, ecosystem analysis, and water quality modeling.

Thank you so much, Chaopeng, for taking the time today.

[00:00:55] Chaopeng Shen: Thank you very much, Dr. Scanlon. Very pleased to be here. Also, I would like to say thanks to Dr. Alex Sun, who connected us.

[00:01:03] Bridget Scanlon: Right. So in this podcast, I hope we will discuss basic parameters used in artificial intelligence and machine learning and deep learning within the context of hydrology and Chaopeng will help us define various terms, describe different approaches and discuss various applications in hydrology and environmental science and then finish by discussing path forward and workforce development likely.

Awesome. Let's get started. Right. So, Chaopeng, you have been in this field for many years now, and I hear from the community a lot of different things about AI and machine learning, and we see it in self-driving electric vehicles. Everybody has a smartphone. A lot of people are using ChatGPT. Even my next door neighbor, who's a realtor, was telling me she was using it. Very useful. Yeah. So it helps us in all sorts of ways. And you have been at the forefront of the application and incorporation of AI/ML into hydrology.

Maybe you can describe a little bit of what that has been like for you and how difficult maybe it was initially.

[00:02:23] Chaopeng Shen: That's a good point to start. So I was a process-based modeler by PhD training, and that was getting into my third, aImost the fourth year, actually, third year of my tenure track clock, when I heard about this fascinating technology out of AI that can beat the traditional methods across many different competitions, that I got intrigued.

And I started to look in, started looking at this and the thing that struck me the most is that these models can actually build, kind of get trained to perform certain functions completely without the supervisions of human, so that's fascinating to me because that means in a way we can be something like a reset opportunity for us to re-examine our assumptions without strong human bias, so that's kind of was fascinating why I got into this.

So we actually, when I got in, it was a much different world than it is today. So TensorFlow wasn't even published. And we were actually, we started working with this esoteric language called Lua. That's the first version of Torch. I actually convinced my student, we learned Lua together. And the first version of our Long Short Term Memory (LSTM) was actually implemented in Lua.

So we wrote this LSTM to model soil moisture. And it took us almost like eight months and we were going nowhere because it wasn't exactly clear what you can get out of this and wasn't exactly clear what the goal was either. What I think the goal to me back then is I think we can improve the predictability.

And indeed, it was almost took us about eight months when we started to see some results. So that eight month was somewhat of a torture. But I, I'm glad that we made it out of that. Yeah.

[00:04:05] Bridget Scanlon: So, so that's nice to hear. I mean, now we feel like we've always had it. It's like the hedonic adaptation in economics.

Once you have it, then you think you've always had it, but it wasn't so. And so it has evolved. And that was probably like 2017, 2018 time period or later. That

[00:04:22] Chaopeng Shen: That was around, actually, that was around 2016 when we started to get results. So, so our first paper came out to the 2017. Right. Yeah. I started to, yeah, just started to write papers and proposals.

I mean, the initial few ones we would obviously run into a lot of questions, which are good questions. And there are some concerns, legitimate concerns. So we've been trying to address those along the way.

[00:04:47] Bridget Scanlon: Right. So maybe we can back up a little bit, Chaopeng, because maybe some of our listeners may not be familiar with some of the terminology that's used in AI ML.

I mean, we use artificial intelligence and machine learning as if they were interchangeable terms, but Alex Sun wrote a very nice review article and kind of defined some of these terms. And I guess the definition he gave for AI was use of computers and machines to imitate human like behavior. Um, and then he mentioned that machine learning is considered a branch of AI that aims to train machines to learn and act like humans.

And then deep learning, which came along in the mid 2000s, refers to a newer generation of machine learning algorithms. for extracting and learning hierarchical representations of data. So I think this may be a little nice to have a bit of background. And when we were talking about these things, so that the listeners can understand where we're coming from.

Do you agree with those definitions Chaopeng?

[00:05:48] Chaopeng Shen: I mostly agree. And, and deep learning was mostly about, about this deep neural networks, which have a large depth and a huge capacity to this neural networks, and they were supposed to be trained on a massive amount of data set. And sort of acquire their functions by through training, right?

It's like you started with a blank slate and it gained all of its functions through learning from data. However, the AI, I think, the concepts of AI keep evolving. To me, when we talk about AI and ML, nowadays I start to think of AI as more advanced machine learning and as entities that actually have some kind of an intelligence, really like human like behavior, like what he, Dr.

Sun said. So for example, the self-driving cars you mentioned or the ChatGPT or large language models that really exhibit, start to exhibit some intelligence. So I would say we can go by the textbooks, but sometimes people when they say AI, they're referring to a, a very intelligent system.

[00:06:48] Bridget Scanlon: Right. And I guess some reasons for a lot of emphasis on the AI-ML these days is because we have what people call big data. And so some people say, well, what, what is big, what are big data? And I guess Alex's review from the National Institute of Standards and Technology, the NIST definition was the high five V's associated with big data, volume, variety, velocity and veracity and variability. So maybe you could just chat briefly about some of those parameters.

[00:07:26] Chaopeng Shen: So the volume is obviously you have to have a lot of data, but there's also gotta be a big variety and that that's also important, right? So repeated data that does not carry much information is not as useful as data that carries a huge variety of information, because the model learns to look at when one input changes, you have a response like certain response, right? So through many different kinds of combinations, the model gradually learned what this one input really means. So that variety aspect is quite important.

And then the velocity is, I think, regarding how fast you can, to me, it refers to how fast you can train the neural network. And if that is too slow, obviously it's not very functional, right? The velocity is talking about, I think, the accuracy of the data, the truthfulness of the data. Right. Yeah.

[00:08:19] Bridget Scanlon: So, so we have a lot of different types of data these days and I mean, even in hydrology, we're monitoring data networks and model data.

And then also we are starting to incorporate more social science data. And that may come in the form of tweets for flooding or other things. So trying to merge all of these and harmonize them is a real challenge and, and to make them useful for forecasting and prediction and stuff. So interesting times and the more and more stuff that we have, the more we have to do with it.

[00:08:54] Chaopeng Shen: Exactly. Like you say, there's unstructured data and you can do, you apply different kinds of models to it, or you can apply pre-processing techniques. So, yeah, interesting times.

[00:09:05] Bridget Scanlon: Right. So unstructured data would be things like emails and videos and crowdsourcing data and things like that. We can't afford not to look at all types of data when we are trying to constrain and understand what's happening in, for example, in a flood event or something like that.

How do you react to people who say, well, AI, ML, it's just a statistics. There's nothing new. It's just basic statistics. Or others make comments like, Oh, it's. It's a black box. We can't interpret it. We don't know what the causal relationships are, but then others acknowledge that they feel like it's transformative.

How do you respond to people who kind of ideas.

[00:09:43] Chaopeng Shen: So I think that's a great question to expand. I want to say that, first of all, in the future with the way I see it, it's not going to be a fancy technique. It's actually going to be like an everyday infrastructure. It's like an operating system. How we interact with data. It's like the cell phone that you use every day. It's going to be very, very common. It's nothing cool about it because you got used to it. That's how I see this.

Secondly. I think it's even the current AI, they're way more advanced than statistics. In fact, there's a funny saying that some people would say, some students would say, you can learn AI without knowing much about stats, right?

Because the fundamental stats they use are at the user level are somewhat quite limited. But the stats, the statistical equations, it can, I mean, in my mind, it can model no more than second or third order of the data, right? And you have multiple dependencies and that could be complicated enough.

However, the AI can really simulate highly, highly complex interrelationships. These joint distributions like our human face, right? So how does AI produce, generate an image of a human face? And there's all these different correlations between the colors of the eyes and what that means for a race and for our facial construct.

So those are highly complex interrelationships that AI can capture, and I don't think it's even possible to begin to describe those things with stats.

[00:11:12] Bridget Scanlon: I think the other aspect I was suggesting, some people say it's a black box and they're a bit allergic to that. So, yeah,

[00:11:18] Chaopeng Shen: I mean, from now, I do not see it as a black box.

There's a lot of things start to make a lot of sense because you have a generic architecture that gets all of its functions from data. It's not very productive to think of as a black box. In fact, you should see it. That's a very transparent box. I think the most productive way that I see it is to see it as different genes.

Think of AI as a collection of great modeling ideas surrounding learning theory, right? A collection of related ideas and each idea gives you certain behavior and that's a good gene. For example, one of the ideas is that you have to learn from big amounts of data so that that relationship you learn is transferable, it's robust, right?

That is a good gene. So to support that, the other thing that I would say was, would be like differentiable programming, because that's the only way you can model complex relationships and train it in a reasonable amount of time. So that's a good gene. Once we see these different genes, we can, we can take them and use them and recombine them in different ways that we like.

So that's how I perceive AI right now.

[00:12:32] Bridget Scanlon: Right. And I think in understanding of basic statistics wouldn't go awry because some, when I hear about some different algorithms in machine learning or deep learning, they're somewhat analogous to some types of basic statistics like principle component analysis or cluster analysis.

[00:12:50] Chaopeng Shen: Right. There's definitely a strong linkage there. I agree.

[00:12:54] Bridget Scanlon: Right. So understanding that statistics, but statistics oftentimes has a lot of constraints and assumptions and stuff, and, and so it seems like AI stuff is just, doesn't have many of those constraints, so it can come up with the ideas that we might not have thought of. So I think maybe early on with AI stuff, there was a lot of emphasis on image recognition, like you were talking about the facial attributes and stuff like that. And then in hydrology and environmental aspects, we also use image recognition, I guess, for satellite data and remote sending land use change and all that sort of thing.

But then another aspect would be time series analysis and prediction. Maybe you can, and you mentioned earlier, LSTM. Long short term memory. Maybe you can describe those two aspects of AI

[00:13:44] Chaopeng Shen: Absolutely. So these are the two prototypical applications. I think that the newcomers should get familiar first, right?

So the first one is image recognition. That's that the original, deep learning, convolutional neural networks, which are built to read an image. And tell you what that image is about, or identify certain pixels on that image. This is called segmentation, where it marks out certain regions on the image that belongs to a certain object.

Then you can already find massive amounts of applications with this convolutional neural network. For example, in remote sensing, this is very similar to the OG image recognition, right? So they use this method to identify land, land use or objects, landslides, or solar panels from satellite images. And in fact, they can borrow these trained networks, the networks that were pre-trained, unlike everyday pictures, these images that people take with their phones, and they can transform knowledge from that model to a satellite image recognition or a satellite segmentation model. So that transfers relatively easily. So I think that's one prototypical application. I've also seen colleagues from Switzerland who use the technology to identify like microorganisms that they have a microscope, they take pictures and they have these microbes that flow through and they identify what these microbes are and after certain amount of counting, they know what's going on with their water quality, right? So those are all very beneficial applications.

The second one is time series prediction. And this is based on like a primary task that hydrologists were tasked to do, which is like predicting these hydrological variables, how they evolve over time, right? So regarding these two, we do have like tutorials online to help people to get into this.

So with the long short term memory LSTM, it's essentially like a self-trained memory system. After it has seen enough data, it essentially learns a response function off either soil moisture or stream flow or water quality variables to the weather forcings and to the landscape characteristics, right?

So after you train with a lot of data, it learns how, for example, soil moisture responds to rainfall, right? When there's a rainfall, soil moisture rises. So it captured those patterns and can predict such variables. And LSTM is really the first deep learning model that the hydrologic community got very familiar with, but you can also do transformers, although in our benchmark that we've run, it's very difficult to beat LSTM.

But, of course, I think when the problem is structured differently, the results might change and when we work with LSTM, there are some benchmark problems, for example, a soil moisture benchmark problem that we provided in our tutorial. And you can try different algorithms on it where it actually takes the soil moisture that's measured around the world by a satellite, and you have also rainfall data that you collect, then you try to capture their interrelationships. It could also be in situ data that's thrown in.

And the other is like stream flow data. There's a CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) dataset where you have about 600 basins across the United States. Really like, and we actually have this big data over the US, right? This USGS collects like 9,000 stations and the whole world maybe has 40,000 stations.

So with those data, we can train the neural network, the LSTM or whatever neural network they're using to predict how stream flow responds to rainfall. And that is a very useful application in terms of stream flow. Forecasting like flood forecasting or drought forecasting. And nowadays we also have more advanced versions with hybrid models, but ALICE was a really good entry point, of course, now in the current climate, you can also try to play with some generative models that gradually there's people in the community that started doing that.

[00:17:45] Bridget Scanlon: Right, so that means generating models just from data, right?

[00:17:49] Chaopeng Shen: So generative models would actually generate some, for example, time series based on certain information you you tell it.

[00:17:57] Bridget Scanlon: Okay. Okay. So I think sometimes maybe we're not aware of how AI is being used, but I think the LandSat data had a lot of issues with cloud coverage.

And, and so now they can correct those data and use AI to do that. So I think that's really nice. And then you talked about time series and you were talking about streamflow, but then I think some of your other work, then you incorporate maybe temperature data and other data then to constrain it and to improve the forecasting. Is that correct?

[00:18:31] Chaopeng Shen: Oh, we found that a lot of the water quality parameters, such as water temperature and dissolved oxygen and nitrate, there is actually some effort led by some collaborators, but we provided some technical assistance. We learned this together, right? We find that these water quality parameters are actually very well captured by LSTM.

So if you're talking about water temperature, for example. The performance was like perfect. Actually, the name of our paper was like exceptional performance and it really is. It's like R square of 0. 99, 0. 997. I would check ourselves many times before we publish those metrics.

[00:19:12] Bridget Scanlon: So what did you start off with?

Then you had stream flow data and you had temperature data and then maybe explain a little bit about what you actually did.

[00:19:20] Chaopeng Shen: So for example, water temperature, it's structured very similarly. The input is weather data, the ordinary ones, precipitation, temperature, radiation, stuff like that. And then you also have some characteristics of the watershed.

If you're trying to model, for example, we would have a slope, soil texture, stuff like that. Those characterize why a basin looks differently than another, right? And you put this all into LSTM and the target data is the time series of soil moisture or actually the water quality indicators. But it's very important that we would train such a network over hundreds of sites together. And this is very important. If you train it over one basin, you get some, somewhat of a faulty model because it's overfitted to that dataset, right? And so learning across the sites is critically important. So after you train it and then you plot the time series and you compare with the observations, of course, sometimes the model is overfitted.

It has some of these faulty or artifact fluctuations that we can add some regularizations to control that. And we can also, some people also try to run like multi objective predictions where the same model is trying to predict multiple species. And that seemed like some people reported some benefits and others reported no benefit, whatever.

That's a beneficial thing to try at least. You can also use one to predict the other. In fact, I would be interested to see how attention models can be used in this kind of scenario.

[00:20:50] Bridget Scanlon:. Right, right. So another aspect of working with these datasets, some of the machine learning algorithms is, they use supervised learning or unsupervised or weakly supervised.

Maybe you can describe those a little bit and, and they're correlated then to fundamental statistics. The proxies or whatever, that's correct.

[00:21:11] Chaopeng Shen:. The AI community has a annoying amount of jargon, and that's very confusing to newcomers. And I spend a lot of time learning this and as I get started, so I'm happy to to discuss them.

However, I said, I think this, these are actually one of the good genes that we want to inherit and sort of kind of recombine into our domain per se. So supervised machine learning is essentially learning a mapping relationship, right? from some input X to Y, right? That's supervised machine. You tell it what Y is and that relationship can be highly complex.

You can give it a deep neural network, get the transformers or whatever. You can give it some as simple as a random forest when you have like attributes based data, right? So some attributes typically low dimensional 40. It's 30 attributes or 12 attributes that lead to one or two outcomes. You can do random forest.

You don't necessarily have to do a deep neural network. You can also do like boosted regression trees. So that's supervised machine learning, and we've been doing this for a long time.

Unsupervised machine learning seeks to dig the patterns in the data, tries to see how the data are organized. For example, the simplest algorithm here would be something like principal component analysis, right?

So we're trying to find out what is the direction that can explain the most variance in the data? So in a way that direction captures the main variability, right? The main direction of change. However, in the age of this AI, the weakly supervised machine learning is even more interesting. So in this setup, this is essentially how you train things like ChatGPT or BERT, these language models, where you have a super complicated neural network.

It is forced to kind of fill in the gap. What predicts the next token that can be interpreted as one particular form of filling a gap, right? So essentially, there's no differentiation between X and Y anymore. There's only X, right? There's only a set of input. And you can be given a long series of input or a lot of inputs with some gaps in between.

And the model is forced to leverage what is provided to it to fill in the gap. What isn't to fill in the gap, what is missing? So this is called a weakly supervised. Well, weakly self-supervised, actually weakly supervised because it's essentially trying to force the model to capture the internal distribution patterns that we call the joint distribution of data, right?

In fact, that just this simple task gives rise to something as powerful as ChatGPT and the fact that this happened was purely amazing, right? I think in the field of geosciences and hydrology, we have done a lot of supervised machine learning, we have done a decent amount of unsupervised machine learning, but still we're just getting to really trying to understand how to leverage this weakly supervised paradigm.

[00:24:06] Bridget Scanlon: So, I mean, when you mentioned supervised, you have a training data set. So do you think rainfall runoff modeling would, so the rainfall could be the input and you train it with rainfall runoff data, and then you predict the runoff. That could be an example of supervised.

[00:24:22] Chaopeng Shen: Exactly. So when you have rainfall temperature data and the outcome is stream flow or soil moisture, that's a clear supervised machine learning problem.

And for, for unsupervised or weakly supervised, you don't have this clear identification of X and Y. I can give you any part of the data. So let's just say you merge your rainfall and your weather data and the soil moisture data, you merge them together, they form one vector called X. So you can mask certain portions of it.

So I can ask you a question in reverse, essentially I say, I observed the soil moisture pattern, can you tell me what the rainfall is? So as the model learned how to fill in the gaps, it should know how to answer that question. Of course, sometimes the question was the opposite, so it doesn't know what to answer.

Actually, those new post problems turn out that in this case, the AI would be able to give you lots of different realizations of it. Just like it's, this is very similar to when you ask ChatGPT or these AIs to give me an image of what my son likes the most ninja, right? So every time you ask it, it gives you a different image of a ninja because that problem in a way, it's ill posed, but that means there's like infinite possible answers to that question.

You give it a crude hint, it leads to a lot of different refined responses. So we can do something similar in hydrology, although we haven't really done too much in that direction yet.

[00:25:53] Bridget Scanlon: I mean, maybe we haven't applied a weakly supervised learning very much in hydrology, but certainly I know my colleagues, I see a lot of ChatGPT and I am, you could see everything is pivotal or heightened or nuanced or Yeah. If you need to recognize the words that it uses quite frequently.

[00:26:14] Chaopeng Shen: Yeah. And honestly, I think this is also where the definition of trying to make the machine do something similar to what humans do is going to diverge from AI because the AI is going to clearly do something that humans cannot do.

It generates so many fascinating images and with very abstract instructions. Some of these images, humans are, it's a challenge for humans to come up with, right? So it's going to become maybe fancier.

[00:26:43] Bridget Scanlon: Right. And I was surprised last fall when I broke my wrist at the airport, when I was supposed to be going to Australia and then I couldn't type very well.

So then I was using the language recognition part of the computer. I was very impressed with it. I thought it was a really amazing.

[00:26:59] Chaopeng Shen: Oh, it was. In the beginning, I thought it was just like a parrot that knows how to parrot what I'm saying, but it's not. It's actually a thing, really like a thinking machine.

And I thought in the few first sentences that I asked you to write, I thought, gosh, it writes so appropriately and so gently, and even got some of these subtleties of how humans respond to certain questions, right? I think it got the whole picture. It's really amazing. I don't know. I feel like after some years, I feel like it's still not there. We're writing anything serious. It doesn't get the depth, not yet.

[00:27:34] Bridget Scanlon: But it might get students over writer block and things like that.

[00:27:38] Chaopeng Shen: I think eventually this is like sort of going to be our operating system. It's kind of unavoidable. Right. But I think as I tell my students, actually, I don't want them to use it when they learn stuff because the AI gets trained. They did not also still. I think we should try to write a lot of things ourselves. So we express our own feelings, not some, what the feelings of AI is, right?

[00:28:02] Bridget Scanlon: Right. So, I mean, if we go back a little bit, the hydrologic community is traditionally using a lot of process-based modeling to understand what's going on.

It's constrained by theory and they have input data and they calibrate the models and they have parameterization and then they get the output, whether it's streamflow, groundwater levels, or soil moisture or whatever. So, but there are lots of limitations to process-based modeling and maybe AI can overcome some of those limitations.

And sometimes people think it's like either or it's process-based modeling or AI. But really, I think maybe the strength could come from some sort of hybrid, like we think about energy transition, fossil fuel car, or an electric vehicle, or a hybrid.

[00:28:53] Chaopeng Shen: Yeah, and that transition can take a long time.

[00:28:56] Bridget Scanlon: Right, right.

And so maybe we can help the transition by using AI to parameterize process-based modeling, incorporating process-based modeling into AI like you are trying to do. So maybe you can describe some of those linkages and how you see it evolving.

[00:29:11] Chaopeng Shen: Right. So before I dive into that, I would like to say that sometimes when we look, when we examine how the world changes, if you look from the side, it seems to be going in circles, it seems to be coming back to the same place.

But if you look at it from above, it seems to be going to the same place. But if you look at it from the side, it's actually doing helix. It's spiraling up because every time it comes back, it's actually a different context. So this is going to rhyme with some of the things I want to discuss.

So regarding this, there's lots of limitations to our original process-based model, but there are also something good about them. I'm going to come to the good ones later. So I think there are many challenges, the one significant challenge was bias, because we have certain assumptions, assumed structure of the model, but when the model has a structural bias or certain deficiencies and just doesn't behave in the way that nature behaved, you're going to see a bias in your model results. It's almost like when you try to put the chairs together, but one of the legs was longer. So you would force it. You put it, barely hold it together, but nothing, it's always going to give you a large error. It's always going to be something is off here and there. Right. And you force this side to stand upright, and then the other side would start to juggle, right?

And the other thing is, we have a lot of parameters in the model, and we run into this problem, but non-uniqueness. Because of this computational expense, and the way we trained our models, we cannot take the model to confront it with lots of data. The paradigm was to confront it with data from one basin or limited amount of basins or sites, right? So you run into this problem where different parameter sets give you the same response. So you cannot differentiate between them. So the uncertainty there was huge. And I remember when I started my PhD, I worked with one of these models SWAT. And I was like, wow, I could, I could change these parameters in 10 different ways. And they gave me the same results. And actually I calibrated 10 times using different random seeds. It gave me very different results. So I didn't know what to trust. Right? And the other thing is, it could also saturate because you're tuning some of the parameters, the learning ability of that model quickly saturates. You learn and AI can do is to absorb some parameters, right? So those are the limitations.

But there's also strengths in that. I mean the processes are clear, and it respects a mass balance law. So you have kind of a bare minimum guarantee of physics there. So you take it to a place where there's no data, it can function in a bare minimum sense, right?

And also, these kinds of models tend to kind of extrapolate better, but the most actually for scientists, the most valuable thing was, is probably that it has full interpretability. And I can understand the whole thing. I can't even explain this to stakeholders, right? So, so that's, I, I see that there's a pros and cons.

So when we move to machine learning, it's only on the other side of the spectrum, right? The biggest advantage it has is that it simultaneously can learn from a massive amount of data. So the relationship that it learns is robust, right? So we can be more confident in, in the conclusions that we make.

And then it's able to actually compute because of the way that the neural networks are structured, that it can compute very efficiently. And that actually allowed it to grind through all this much data. and train a highly sophisticated neural network with large amounts of weights. So just on a basic level, like our process-based model, how many parameters do you have? Maybe 12, right? Maybe 20. And that's very high dimension, but consider something like the LSTM. Yeah, we trained, right? It has 500,000 weights and ChatGPT probably has a trillion and Microsoft said they were going to have a model with 56 trillion weights, right? So the orders of magnitude are different by so many orders of magnitude.

And along with that comes the complexity of representing these real-world functions and the ability to actually model almost any function, right? To actually absorb a lot of information from data.

[00:33:26] Bridget Scanlon: Right. So I guess what you're talking about then is the process-based modeling. It's following theory so we can interpret it and we can understand what's happening.

And then we can probably use it more in a predictive sense because we've got the theory. It may be biased, but then a purely AI model driven by data will come up with some sort of relationship to predict what's going on, but it may not conserve mass.

[00:33:52] Chaopeng Shen: Yeah, it may also learn the wrong relationship. So this happens all the time.

So it can potentially give you very good results for the wrong reasons. So the sensitivities might be wrong, right? And the reason is in our world all of the things are co-varying that they're changing at the same time. So when the AI learns those relationships, it's not able to pick up which one is that, like the cause of relationship, right?

Which one is the true cause, which one is the effect. It doesn't have that physical concepts, right? So it could, especially when you take this trained model and try to predict in a region. But a region where there's not much data to cover that, then you're like, it's likely to make mistakes.

[00:34:40] Bridget Scanlon: Right. But then you can absorb, ingest lots and lots of data and come up with relationships with high speed, of course, depending on your computer architecture. So there are differences between process-based modeling and the purely AI model. But then I think some of your work, then you were trying to combine these.

Maybe you can describe the differentiable modeling approach that you are promoting.

[00:35:05] Chaopeng Shen: Yeah. Thanks. Thanks for the lead. Yeah. So I was actually thinking about the similarity and commonality, the commonality and dissimilarity between these models. And I realized that really the one thing that is different is this differentiable programming that allows AI models to train massive amounts of weights.

Okay. So, because if you can understand an AI model, let's say supervised machine learning, right? So we're trying to capture some relationship Y equals to f(x). Right? And we tune some weights to match the observations. But we've been doing this for a long time for process-based models, right? You would have some physical parameters sitting there and you calibrate it.

But why is it that the AI models get it much better, right? Much more robust. And that is because it has access to this technology called differentiable modeling, inspired, probably propelled by the progress in AI to assimilate tons of data, right? So with that, you have the book programming, which is a way to keep track of information so we can compute the gradients and do back propagation.

So the gradients of outputs with respect to inputs in the inside, the model and chain. And once you have such a system, you can back propagate that entire system, the grad information back through. And you can train, use this gradient-based optimization to train highly parameterized neural networks.

Right. So that is the fundamental technology behind neural networks. And I realized we do not need the entire package of AI to actually have this key, what I call gene, right? This pro differential programming being a gene. We can actually take that gene and use along with our process-based knowledge. And have all the good things from this gene and also marry it in a way with the equations that we know with the laws that we understand, right?

So we can have the best of both worlds. So I make a joke. It's like when we want a truly beautiful thing, like let's just say a young boy wants to propose to a girl and he doesn't need to marry the entire family, right? He only needs one thing out of this, I mean, that's a bad joke, but you get where I'm going.

So the idea back then was to take, of course, some other domains started doing this as well. So it's to take the differentiable programming paradigm. And make our models differentiable. So once our models are differentiable, it can be coupled to neural networks. And these neural networks can serve to model a missing link in the whole chain, right?

It can, model a missing relationship. And that missing relationship can be either a parameterization relationship, right? So it can model, let's say, how can I generate the parameters that these models need? Right? It could also be a missing process, like how does rainfall translate into runoff, right?

Or how does the water translate into ET or resistance parameters, right? So we can learn that missing relationship and simultaneously grind through a massive amount of data. And all those big characteristics can come along as long as you have this differentiable programming gene.

[00:38:21] Bridget Scanlon: Right. So I understand that you are now trying to apply this in the national water model with the next generation national water model.

Is that correct? Chaopeng?

[00:38:32] Chaopeng Shen: Yes. So that's a grand goal. So we know the national water model is responsible for making flood forecasting around the nation, keeping people safe, reducing damage. I think there's an opportunity to really improve it using this AI technology. So we're trying to incorporate both data driven machine learning like, like LSTM, and these differentiable models where you have actually neural networks to sort of enhance your process-based models. So, in the end, we might get a next generation national water model that is highly capable. So, with our benchmarks, we show that these differentiable models, they can be as strong as these purely data driven models.

Even better in the data sparse regions, and they can also provide you with this full process clarity. So I think the next generation, of course, we are trying to be one of the candidates for this model, right? So I don't want to oversell what we're doing, but I think that we're going to offer a lot to the American public or the whole world.

Something that can much better forecast flood, even in a data scarce region. And I want to add one point to the differentiable programming. So the differentiable models can utilize this automatic differentiation engine that a current machine learning platforms provide, but also you can do it in different ways.

And you can, you can do it using what we call a joint operator, which runs through some mathematics to give you the gradients. Or you can do finite difference, whatever way to provide that gradients would be enough. And the whole point is, once you are able to make this linkage, it's like you're providing a wing to your model.

It's like what we say, tigers have wings, or you can say that you can actually now connect a huge augmented brain to your own brain. Now you're able to do a whole lot more. So the entry ticket, the ticket to the entrance is this differentiable modeling, the ability to, to be differentiable. Right.

[00:40:30] Bridget Scanlon: Right. You started off in 2015 2016 working in these topics. What do you think about the workforce development now? I mean, are we teaching students in colleges the tools that they need to, to apply these techniques or is it being, I mean, we have industry doing it, we have a lot of different industries doing it, but I mean, the government agencies, are they taking it up?

Is there the workforce there?

[00:40:59] Chaopeng Shen: To do it. Yeah, in my perspective, I think there's a great shortage of AI education in non-computer science majors. Especially engineering, we need to enhance it by a lot. I think we are not doing what is needed. And a lot of departments around the nation is actually trying to make up for that deficiency.

So I think we need to enhance the AI education, not only at the graduate level, which some faculties are trying to do here and there, but also at the undergraduate level. And that's a core deficiency because the undergraduate curriculum never have this AI education, they learn some stats. I mean, they learn some, some basic concepts, but they don't really get any actually training to do AI, maybe because they were in engineering.

But the pattern is that AI is going to be incorporated as a basic infrastructure. So the engineering students do need to know that. So I, I think I see that we need to enhance our teaching at the undergrad level. We need to let the students know that this is important. And I do see some students, they kind of self-select and they would think that they are, I cannot do AI, but actually they didn't know that.

The AI is probably easier than their fluid mechanics course. Yeah, really. So it's something that is actually a generic technique you use and you can apply to all these different domains. It's actually a very time worthy thing to learn. But the other thing I want to comment on is I think I've noticed some lack of diversity in the classrooms, especially in terms of gender, because if you check the background.

In the, I'm blaming, I'm pointing the fingers at other than them. Okay. So this background, this source major of computer science, there's a huge gender imbalance there. And that resulted from really different reasons because of the stereotypes, because of what some of these readout courses, because I'm not some of the precinct preconceived notions that students have.

So I do think that we need to, we need to bring more diversity into the classrooms in terms of gender and race, and, and we can achieve that actually, I have a. A grad student and she does it, it does very well. Like she's just awesome. Like, I don't think there's any, any way, shape or form. And I think there's a lot of work for us to do.

[00:43:14] Bridget Scanlon: Right. Well, I really appreciate your taking the time to explain some of the fundamental concepts behind AI, ML and deep learning. And how you got into the field and how it's being applied in hydrology and, and how we can have maybe sort of a hybrid car where we're combining process-based modeling with AI to kind of optimize and get the advantages of both.

So thank you so much Chaopeng for your time. Take care.

[00:43:41] Chaopeng Shen: Thank you a lot for the opportunity. Okay. Bye bye.

Physics-informed Machine Learning for Discovering Knowledge in Hydrology - Transcript

Thank you to our podcast partners