Launch AWS g4dn.12xlarge instance
Access to instance with SSH
Check available conda environments
conda info --envs
Activate virtual environment
source activate tensorflow2_latest_p37
Check gpu status
nvidia-smi
Check multi-gpu
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
print('Name:', gpu.name, ' Type:', gpu.device_type)
Check name of gpu
import tensorflow as tf
tf.test.gpu_device_name()
Setting for parallel training with specific gpu names
use_gpus = ['/device:GPU:0', '/device:GPU:1']
strategy = tf.distribute.MirroredStrategy(devices=use_gpus)
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
with strategy.scope():
model = build_model()
model.compile(loss=tf.keras.losses.categorical_crossentropy,
optimizer=my_optimizer,
metrics=['accuracy'])
Launch tensorboard
tensorboard --logdir=./logs &
ps -ef | grep tensorboard | grep -v grep | awk '{print $2}' | xargs kill
SSH tunneling for tensorboard
Check GPU utilization
watch -n 0.5 nvidia-smi
ssh -i my-aws-key.pem -NL 6006:localhost:6006 ubuntu@ec2-public-dns
References
https://www.tensorflow.org/guide/distributed_training
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy
https://keras.io/guides/distributed_training
https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-tensorboard