TD(1)

 Posts: 138
 Joined: Tue Feb 12, 2008 8:31 pm
 Contact:
TD(1)
I'll try to explain TD(1) as simply as possible.
The idea is to tune the evaluation function so that it predicts game outcomes as accurately as possible.
Depending on the game, the nature of the game outcome might be a bit different. Some Othello programs try to predict and maximize the final score. It is also possible to predict the probability of winning if the outcome is either win or loss. For games with draws, such as chess, the evaluation might return 3 outcome probabilities. TD(1) could work in any of these scenarios.
TD(1) works by playing plenty of selfplay games that will serve as the training data. The more the better. It is not necessary to play very strong selfplay games. For my Crazy Zero experiments, I usually use a search tree with about 400 nodes for each move. 3 or 4 plies of alphabeta search might be enough for a traditional alphabeta engine. It is important to avoid duplicates in the selfplay games. Using a large set of random balanced starting positions might be a good way. Or choosing opening moves at random.
Then, positions from this training data are sampled at random, and the parameters of the evaluation parameters are tuned by gradient descent to match the outcome of the game.
The principle of gradient descent consists in defining an error function, and updating weights to reduce the error function.
When trying to predict the probability of winning, the error function is typically E = log(probability_of_actual_result_according_to_evaluation_function). E is always positive. E is zero if the evaluation function predicted the result correctly, and can reach +infinity if the evaluation function predicted the wrong result with 100% certainty.
When trying to predict a score, the quadratic function is more common E = (predicted_score  actual_score)^2. In fact, AlphaGo used the quadratic error function to learn their probability of winning. It is a bit unusual, but the quadratic error function may be used to learn the probability of winning as well.
Gradient descent works with a learning rate that indicates how much evaluation parameters will change at each update. It is usual to start learning with a high learning rate (the highest that does not produce divergence), and decay it during learning. Typical decay is exponential (for instance, divide the learning rate by 2 every hour for one day). The equation to update one parameter (w) of the evaluation function is like this:
w < w  learning_rate * (dE/dw)
dE/dw is the gradient of the error with respect to w. It can be estimated approximately by (E(w + epsilon)  E(W  epsilon)) / (2 * epsilon). If you know how to compute the derivative, you may also replace this numerical approximation by a simpler formula.
In intuitive terms, this equation says: if increasing the evaluation parameter w a little also increases the error E, then decrease w a litle. Conversely, if increasing the evaluation parameter w decreases the error E, then increase the evaluation parameter a little.
postscriptum 1: traditional handmade heuristics for alphabeta engines were not always designed to predict the game outcome. For instance it is usual in chess to use an evaluation function in centipawns. In order to apply TD(1) to such an evaluation function, it is necessary to convert such a value to a probability of winning (between 0 and 1). The usual way to do it is to apply the logistic function to the centipawn evaluation:
probability_of_winning = 1.0 / (1.0 + exp(evaluation * scale))
With scale being a wellchosen constant.
postscriptum 2: the weight update equation can be refined by using a different learning rate for each weight, adaptively tuned to how sensitive the error is to this weight. See for instance the Adam algorithm (https://arxiv.org/abs/1412.6980).
The idea is to tune the evaluation function so that it predicts game outcomes as accurately as possible.
Depending on the game, the nature of the game outcome might be a bit different. Some Othello programs try to predict and maximize the final score. It is also possible to predict the probability of winning if the outcome is either win or loss. For games with draws, such as chess, the evaluation might return 3 outcome probabilities. TD(1) could work in any of these scenarios.
TD(1) works by playing plenty of selfplay games that will serve as the training data. The more the better. It is not necessary to play very strong selfplay games. For my Crazy Zero experiments, I usually use a search tree with about 400 nodes for each move. 3 or 4 plies of alphabeta search might be enough for a traditional alphabeta engine. It is important to avoid duplicates in the selfplay games. Using a large set of random balanced starting positions might be a good way. Or choosing opening moves at random.
Then, positions from this training data are sampled at random, and the parameters of the evaluation parameters are tuned by gradient descent to match the outcome of the game.
The principle of gradient descent consists in defining an error function, and updating weights to reduce the error function.
When trying to predict the probability of winning, the error function is typically E = log(probability_of_actual_result_according_to_evaluation_function). E is always positive. E is zero if the evaluation function predicted the result correctly, and can reach +infinity if the evaluation function predicted the wrong result with 100% certainty.
When trying to predict a score, the quadratic function is more common E = (predicted_score  actual_score)^2. In fact, AlphaGo used the quadratic error function to learn their probability of winning. It is a bit unusual, but the quadratic error function may be used to learn the probability of winning as well.
Gradient descent works with a learning rate that indicates how much evaluation parameters will change at each update. It is usual to start learning with a high learning rate (the highest that does not produce divergence), and decay it during learning. Typical decay is exponential (for instance, divide the learning rate by 2 every hour for one day). The equation to update one parameter (w) of the evaluation function is like this:
w < w  learning_rate * (dE/dw)
dE/dw is the gradient of the error with respect to w. It can be estimated approximately by (E(w + epsilon)  E(W  epsilon)) / (2 * epsilon). If you know how to compute the derivative, you may also replace this numerical approximation by a simpler formula.
In intuitive terms, this equation says: if increasing the evaluation parameter w a little also increases the error E, then decrease w a litle. Conversely, if increasing the evaluation parameter w decreases the error E, then increase the evaluation parameter a little.
postscriptum 1: traditional handmade heuristics for alphabeta engines were not always designed to predict the game outcome. For instance it is usual in chess to use an evaluation function in centipawns. In order to apply TD(1) to such an evaluation function, it is necessary to convert such a value to a probability of winning (between 0 and 1). The usual way to do it is to apply the logistic function to the centipawn evaluation:
probability_of_winning = 1.0 / (1.0 + exp(evaluation * scale))
With scale being a wellchosen constant.
postscriptum 2: the weight update equation can be refined by using a different learning rate for each weight, adaptively tuned to how sensitive the error is to this weight. See for instance the Adam algorithm (https://arxiv.org/abs/1412.6980).
Re: TD(1)
Hello there.
I don't know if this is the right place to ask my question, but i'll try anyway (feel free to remove this message if it doesn't belong). I have trouble with interacting with people either online or offline, i would often do things the wrong way and make people crazy at me and get banned. As might already be obvious.
So here my question for you. It looks more and more like i stumbled upon an alternative to gradient descent. I wasn't thrilled by the gradient descent algorithm to say the least, and because after all, i only have one life, i though i would just do my own stuff, starting back from the 1960's perceptron and how i believe/feel the biological world works. Because i can't / don't want to interact with people, both for their well being and mine, i don't have much way to get information from the outside. I wasn't even able to convince myself that my work has for sure some truly unique and original components to it, i'm like about 70% convinced of it though. Hence i resort to asking you for advice, given that you explored a lot of subjects and certainly know how real life stuff work, from the [Ethical / IntellectualObjective / RealisticEconomic] 3D space vector perspective.
I still have a lot of work to do before i'm happy with my set of software principles, so i'm not yet fully ready, but i believe that i might be in the coming 12 month if i keep working steadily. At that point i expect to eventually have a way to learn mostly continuous medium dimensional functions [Exactly like a deep neural network not using convolutions but direct input/ouput mapping]. It's still hypothetical, and i will most certainly go on as far as i can get whether it earns me something or not. But i'm wondering, what could i do to earn something out of it ? I don't need much from life, mostly eating / sheltering. Still having a bit of money, maybe 500 000 euros might be a good thing, it would help me continue to be free to do my own things, and relieve the risk of becoming homeless.
So, let's say that i have my universal approximator, how would i convince someone (let's say you), that it has value, without showing the exact process the main algorithm use (In case it perform well enough to help me reach my 500 000 euros life goal somehow)? Do you think there is a way for an outsider without diplomas or any connection to contribute to academia, and how would you go about it ? Given that i don't think that i can directly interact with people on a regular basis, and that i still want to make money out of my work if it is at all possible.
I live and was born in France.
PS : je suis tombé sur ta thèse sur les nageurs. Je crois que ça répond à une partie de la question que j'ai ici posé (comment convaincre que mon système est capable d'approximer une fonction). En plus du coup j'ai une base pour le simulateur + un point de comparaison.
I don't know if this is the right place to ask my question, but i'll try anyway (feel free to remove this message if it doesn't belong). I have trouble with interacting with people either online or offline, i would often do things the wrong way and make people crazy at me and get banned. As might already be obvious.
So here my question for you. It looks more and more like i stumbled upon an alternative to gradient descent. I wasn't thrilled by the gradient descent algorithm to say the least, and because after all, i only have one life, i though i would just do my own stuff, starting back from the 1960's perceptron and how i believe/feel the biological world works. Because i can't / don't want to interact with people, both for their well being and mine, i don't have much way to get information from the outside. I wasn't even able to convince myself that my work has for sure some truly unique and original components to it, i'm like about 70% convinced of it though. Hence i resort to asking you for advice, given that you explored a lot of subjects and certainly know how real life stuff work, from the [Ethical / IntellectualObjective / RealisticEconomic] 3D space vector perspective.
I still have a lot of work to do before i'm happy with my set of software principles, so i'm not yet fully ready, but i believe that i might be in the coming 12 month if i keep working steadily. At that point i expect to eventually have a way to learn mostly continuous medium dimensional functions [Exactly like a deep neural network not using convolutions but direct input/ouput mapping]. It's still hypothetical, and i will most certainly go on as far as i can get whether it earns me something or not. But i'm wondering, what could i do to earn something out of it ? I don't need much from life, mostly eating / sheltering. Still having a bit of money, maybe 500 000 euros might be a good thing, it would help me continue to be free to do my own things, and relieve the risk of becoming homeless.
So, let's say that i have my universal approximator, how would i convince someone (let's say you), that it has value, without showing the exact process the main algorithm use (In case it perform well enough to help me reach my 500 000 euros life goal somehow)? Do you think there is a way for an outsider without diplomas or any connection to contribute to academia, and how would you go about it ? Given that i don't think that i can directly interact with people on a regular basis, and that i still want to make money out of my work if it is at all possible.
I live and was born in France.
PS : je suis tombé sur ta thèse sur les nageurs. Je crois que ça répond à une partie de la question que j'ai ici posé (comment convaincre que mon système est capable d'approximer une fonction). En plus du coup j'ai une base pour le simulateur + un point de comparaison.
2002 wrote:Unlike the 3segment instabilities of Figure 7.5, the 5segment swimmer did not manage to recover from this crisis, and its neural network blew up completely. What happened exactly is still a mystery.

 Posts: 138
 Joined: Tue Feb 12, 2008 8:31 pm
 Contact:
Re: TD(1)
A way to get a recognition for a good machinelearning system is to win competitions. Some don't require that you disclose your code or method. Kaggle has competitions with big money prizes:
https://www.kaggle.com/competitions
https://www.kaggle.com/competitions
Re: TD(1)
Yes, i did have a look at it. Did you ever win any money out of it, or do you know personally any people that did win money out of it ? It looks a bit like a blend between bitcoin mining and lottery from where i stand right now. I guess it would fit my need to avoid people contact though. But then current models are well understood and optimized, and competitors are able to use whatever hardware they want if i understood correctly. Also they work in teams, with a hard deadline. So even if i was to build a set up, i don't really see any future where it would be worth the try. Currently i have the computing power of a laptop at my disposal, and even then i would only use one single core. More in in line with what you had to work with in 2002 i suppose, than current standards. I do have access to a family computer that has a 2080, but i really don't see myself convincing them to let me install Linux, and then running it for days, nor do i believe that time spent with cuda would be very well invested right now. I'm under the impression that building a poker bot would have more hope of winning cash than competing in Kaggle. Seems more accessible too. Although i wonder if online poker still exists, and if so for how long given that a gaming computer can clearly run a bot that would beat most humans. Also even then it's not clear if i could build a bot easily with reasonable hardware. Seems like a lot of work and risk. I guess it would take me at least 3 years to get somewhere. I most certainly would first try to build a bot for a simplified poker like game (with order of magnitude less branching), in order to asses my chances.Rémi Coulom wrote: ↑Tue Jan 19, 2021 4:56 pmA way to get a recognition for a good machinelearning system is to win competitions. Some don't require that you disclose your code or method. Kaggle has competitions with big money prizes:
https://www.kaggle.com/competitions
I'm very intrigued by your swimmers and the driving competitions that existed when you wrote your thesis though. Do you know if that kind of competitions built around small scale robust controls still exists ? If i could try my algorithms on that kind of competitions while using a single core, it would be in line with my current goals of proving that it works at least to myself, and maybe to others. And also it might be fun. Also i could be sending some code if that is required, if at least it gets me somewhere. Not doing anything at all, is certainly worst in the end than loosing my ideas to some ethically weak minded people out there.

 Posts: 138
 Joined: Tue Feb 12, 2008 8:31 pm
 Contact:
Re: TD(1)
The successor to RARS is TORCS:
http://www.berniw.org/trb/
It is more complex, but also more interesting. I made a TORCS driver during my thesis, too. It was fun.
http://www.berniw.org/trb/
It is more complex, but also more interesting. I made a TORCS driver during my thesis, too. It was fun.
Re: TD(1)
Now that looks exactly like the kind of things that i might try. Thanks a lot ! I'll be sure to keep some update if i'm able to do something showable. Is there a place on this forum where a TORC bot discussion would fit ?Rémi Coulom wrote: ↑Tue Jan 19, 2021 8:26 pmThe successor to RARS is TORCS:
http://www.berniw.org/trb/
It is more complex, but also more interesting. I made a TORCS driver during my thesis, too. It was fun.

 Posts: 138
 Joined: Tue Feb 12, 2008 8:31 pm
 Contact:
Re: TD(1)
Feel free to create your TORCS thread in the General Discussion (or here if you use tree search, but it is unlikely).
Re: TD(1)
You wrote this at the end of your paper ESANN2004.pdfHighAccuracy ValueFunction Approximation with Neural Networks Applied to the Acrobot wrote:A challenge for the future would be to try to design algorithms that combine the dataefficiency of local linear approximators with the approximation ability of feedforward networks.
Has progress been made on this topic ? How exactly would you measure/compare the dataefficiency of the new algorithm ?

 Posts: 138
 Joined: Tue Feb 12, 2008 8:31 pm
 Contact:
Re: TD(1)
I have not kept uptodate with this research topic, but I know it has been studied a lot. I found this with a quick Google search:
https://sites.google.com/site/dataefficientml/
Selfsupervised learning is a related trend recently. You can probably find a lot more by searching the web.
https://sites.google.com/site/dataefficientml/
Selfsupervised learning is a related trend recently. You can probably find a lot more by searching the web.