discrete sac 가 continuous sac와 다른 점

About_work·2023년 10월 13일

0

강화학습

목록 보기

8/20

염두할 부분

지식을 쌓으려고 해야한다.
깊은 생각을 통한 원인 분석
open source나 진영님 코드랑 비교해보기

다른 부분

policy network
- ~~policy network 코드 체크해보기~~
- logit이 25개 output을 출력하도록 변경하고 학습해보기.
- open source 코드와 비교
~~ez greedy~~
- ~~코드 체크해보기~~
~~actions_to_commands~~
- ~~코드 체크해보기~~
~~value loss~~
- ~~코드 체크해보기~~
~~actor loss~~
- ~~코드 체크해보기~~
~~temperature loss~~ (문제 없음!)
- ~~코드 체크해보기~~
- ~~target entropy~~
copy 말고, moving average로 학습해보자.
popart 다시 확인해보기
- popart 복습하기
- 현재 코드 개념적 문제 없나 살펴보기
- 진영님 코드랑 비교해보기.
진영님 옛날 코드 learning Rate 체크하기

접근 방식

popart 빼고 돌려보기
- company desktop 0
- 브랜치: test/discrete-sac-without-popart
- ./nl_navigation.sh test/discrete-sac-without-popart 10152322 0
- 결과
  - path 아주아주아주 소폭 상승
  - collision 조금 하락함.
temperature loss를 빼고 학습해보기
- company desktop 1
- exploration 에 대한 조절을 안하는 꼴이다.
- 브랜치: test/discrete-sac-without-temperature
- ./nl_navigation.sh test/discrete-sac-without-temperature 10152325 1
- 결과
  - path도 떨어졌고, collision이 엄청나게 발생하였다.
  - temperature loss는 필요하다.
learning_rate 1/10으로 해서 돌려보기.
- home old desktop
- 브랜치: test/discrete-sac-with-0.1-lr
- ./nl_navigation.sh test/discrete-sac-with-0.1-lr 10171022 0
copy 대신, moving average로 돌려보기
- home young desktop
- 브랜치: test/discrete-sac-moving-average
- ./nl_navigation.sh test/discrete-sac-moving-average 10171023 1
policy network를 25개로 바꿔서 돌려보기
- 안정함
- 브랜치: test/discrete-sac-logits-25-not-10
tensorboard 받아오기
- python scp.py 0:test/discrete-sac-with-0.1-lr 1:test/discrete-sac-moving-average

접근 방식 결론

참고: 이 모든 것은, 학습이 더 잘되는 phase가 오면, 다시 돌려볼 필요 있다.
popart는 정상작동 하는 것 같다.
moving average가 더 좋은 것 같다.
temperature loss가 있어야, alpha를 낮춰줄 수 있다.
- 처음에 탐험을 많이 하도록 초기화 되어 있음
- maximum_entropy + 0.98이라 할지라도, 시작 entropy가 더 높은 것 같다.
learning rate는 0.1 낮춰도 별 효과 없다. (원래대로가 더 빠르게 학습되므로 원래대로가 낫다 3e-4)
policy network를 25개로 바꿔서 돌려보기 -> 아직 안했음
alpha값이 0.4 정도로 수렴했다 -> 탐험을 너무 많이 하고 있는 것 같다!

접근 방식 2: target_entropy_ratio 바꿔가며 해보기!

target_entropy_ratio 0.75
- company desktop 0
- 브랜치: test/discrete-sac-new-alpha-0.75
- ./nl_navigation.sh test/discrete-sac-new-alpha-0.75 10191819 0
target_entropy_ratio 0.5
- company desktop 1
- 브랜치: test/discrete-sac-new-alpha-0.5
- ./nl_navigation.sh test/discrete-sac-new-alpha-0.5 10191820 1
target_entropy_ratio 0.25
- home_young_desktop
- 브랜치: test/discrete-sac-new-alpha-0.25
- ./nl_navigation.sh test/discrete-sac-new-alpha-0.25 10191821 0
tensorboard 받아오기
- python scp.py 0:test/discrete-sac-new-alpha-0.75 1:test/discrete-sac-new-alpha-0.5 2:test/discrete-sac-new-alpha-0.25

접근 방식 3: base를 moving average + no ez greedy

discrete-sac-no-ez-greedy
- company desktop 0
- ./nl_navigation.sh test/discrete-sac-no-ez-greedy 10211058 0
discrete-sac-7-action-number
- company desktop 1
- ./nl_navigation.sh test/discrete-sac-7-action-number 10212327 1
- tensorboard 받아오기
  - python scp.py 0:test/discrete-sac-no-ez-greedy
discrete-sac-min-target-entropy
- home_old_desktop
- ./nl_navigation.sh test/discrete-sac-min-target-entropy 10220034 0
tensorboard 받아오기
- python scp.py 0:test/discrete-sac-no-ez-greedy 1:test/discrete-sac-7-action-number 2:test/discrete-sac-min-target-entropy

그 외 체크할 부분

~~self.previous action: (b, action_dim) 인데, 잘 들어가고 있는지?~~
- 바로 전 step의 action이고,
  - continuous에서는, action_dim의 모든 dim에서, -1~ +1 사이의 값
  - discrete에서는, action_dim의 모든 dim에서, 0~action_number 사이의 값

새로운 것이 들어오면 이미 있는 것과 충돌을 시도하라.

이전 포스트

soft actor critic 설명 및 네트워크 구조

다음 포스트

I2Q: A Fully Decentralized Q-Learning Algorithm

0개의 댓글