TensorFlow Multi-GPU Profiling

odobenus·2021년 9월 7일
0

DL profiling

목록 보기
1/2
post-thumbnail

Introduction

Setting

Launch AWS g4dn.12xlarge instance
Access to instance with SSH
Check available conda environments


conda info --envs

Activate virtual environment


source activate tensorflow2_latest_p37

Check gpu status


nvidia-smi

Check multi-gpu


import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print('Name:', gpu.name, ' Type:', gpu.device_type)

Check name of gpu


import tensorflow as tf
tf.test.gpu_device_name()

Setting for parallel training with specific gpu names

  • data parallel
  • synchronous

use_gpus = ['/device:GPU:0', '/device:GPU:1']
strategy = tf.distribute.MirroredStrategy(devices=use_gpus)
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

with strategy.scope():
    model = build_model()
    model.compile(loss=tf.keras.losses.categorical_crossentropy,
                optimizer=my_optimizer,
                metrics=['accuracy'])

Launch tensorboard


tensorboard --logdir=./logs &
ps -ef | grep tensorboard | grep -v grep | awk '{print $2}' | xargs kill

SSH tunneling for tensorboard



Check GPU utilization


watch -n 0.5 nvidia-smi
ssh -i my-aws-key.pem -NL 6006:localhost:6006 ubuntu@ec2-public-dns

Code

Result

Conclusion

References
https://www.tensorflow.org/guide/distributed_training
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy
https://keras.io/guides/distributed_training
https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-tensorboard

profile
indexing

0개의 댓글