Home>

I am a beginner in reinforcement learning.
Following the previous question , I ’m going to ask you a question because I ’m still not sure.

I am trying to make a solver for an original board game (the search space is very wide like Go).

  1. I'm thinking of implementing it using DQN. When learning, will I learn what I have stored in a file with other solvers' self-game data such as the MiniMax method? Or will you adjust the network each time while playing like a playout?

  2. Also, I thought that Shogi would be helpful, so I examined what kind of concept it was implemented. However, Shogi has excellent teacher data in the first place, but since it is an original game this time, there is no teacher data, and I do not know how to implement it. I'm sure I will use reinforcement learning, but I don't know what kind of data is necessary and what kind of processing is necessary. Therefore, I would be happy if you could give me some advice. Also, it would be useful to tell us some useful competitions and important keywords.

While there is room for thinking about whether or not the machine learning policy should be followed, first of all, we would like to understand what kind of processing should be done conceptually.

  • Answer # 1

      

    Using DQN

    DQN is used to capture video games that cannot be read ahead, such as MiniMax and Monte Carlo. It seems.

      

    Because this is an original game, there is no teacher data, and I don't know how to implement it.

    The idea of ​​reinforcement learning is to learn from the final results if you don't understand.
    For example, let's play and remember all aspects that appeared during that game.
    And if you win, we will give you the correct answer label and learn all the aspects.
    If you lose, do the reverse.
    Even if you win, not all of them are correct, but by repeating it, you can find a winning pattern.
    The nature of the data may be different, but it's roughly the same as just supervised learning.

    I don't know what a game called Solver looks like, but I'll touch on the implementation a little more, assuming it's like a board game like Othello, Chess Shogi, etc.

    1. Use Tanh function for output layer.
    If it is -1, your win rate is 0%. If you are 1, your win rate is 100%.
    It is easier to learn because it provides a stronger gradient than the sigmoid function.
    However, as the weight increases, only -1 or 1 is returned, so we recommend using l2 norm or BatchNormalization.
    AlphaGoZero uses both.

    2.Make the pre-read evaluation value learned by MiniMax (like Q value in DQN).
    It is necessary to be able to evaluate correctly with less search.
    I will learn every time I point out or at the end of a station.
    In AlphaGoZero, learning is performed every 8 searches using the Monte Carlo method (PUCT algorithm).
    (Teacher data size is 8 * 19 * 19 Total search number is 1600 times)
    At this time, it is taught that the situation is the same even if the situation rotates by rotating the situation randomly.

    3. Learn from the final results. (The correct answer label is -1 if losing and 1 if winning)
    You will learn every station or every time you play a certain number of games.
    AlphaGoZero conducts mini-batch learning using every aspect of every 500,000 stations.
    (The optimization method is Momentum mini-batch size is 2024 and the number of iterations is 1000)

    Evaluation value obtained by look-ahead search + Learning from the final result is like learning at some timing. I do not try to imitate this area completely, but I think that it can be decided by groping or preference.

    By the way, in the link below, there is a report that a better result than AlphaGoZero was obtained by changing the balance of learning of Q value (evaluation value of prefetch search) and learning of game result.
    http://tadaoyamaoka.hatenablog.com/entry/2018/07/01/121411

  • Answer # 2

      

    I am trying to make a solver for an original board game (the search space is very wide like Go).

    If you don't know a lot, you should start with a full-fledged game or Othello.


      

    I'm thinking of implementing it using DQN, but when learning, will I learn what I learned from other solvers such as the MiniMax method saved in a file? Or will you adjust the network each time while playing like a playout?

    If you don't implement MinMax, you won't know how wonderful machine learning techniques are!
    If you don't try to make something better than MinMax for the time being, it will be a game with clever machine learning AI.
    AlphaGoZero is said to have better performance than AlphaGo only by self-playing, so it is possible in principle, but it was possible because there were many experts and know-how, for the time being there is teacher data and parameters You have to get a sense of tuning. You can expect a natural sense to be achieved in one shot.

      

    In addition, I thought that Shogi would be helpful, and I examined how it was implemented. However, Shogi has excellent teacher data in the first place, but since it is an original game this time, there is no teacher data, and I do not know how to implement it. I'm sure I will use reinforcement learning, but I don't know what kind of data is necessary and what kind of processing is necessary. Therefore, I would be happy if you could give me some advice. Also, it would be useful to tell us some useful competitions and important keywords.

    DQN is a kind of reinforcement learning, so you should do it.

    Introductory book ↓
    http://incompleteideas.net/book/bookdraft2017nov5.pdf

    Light overview ↓