TensorFlow for Trainium instances?

0

Hi,

Is there more documentation/examples for TensorFlow on Trn1/Trn1n instances?

Documentation at: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/tensorflow/index.html has docs/examples on TensorFlow for Inference (Inf1/Inf2) (but not for Training (Trn1/Trn1n).

For training only PyTorch seems available. Is this correct? Or more coming soon?

Thanks!

3 Answers
1
profile pictureAWS
EXPERT
answered 2 months ago
profile pictureAWS
EXPERT
iBehr
reviewed 2 months ago
  • Thanks for answering! This particular version you refer to (2.9.1) explicitly says: "TensorFlow Neuron support for Training is coming soon." The current version 2.18.2 (aka latest as in my link above) doesn't say this explicitly anymore but lacks any link to docs/samples. Only for Inference, not for Training.

0
Accepted Answer

@arnoud The Neuron devices are not a drop in replacement, they work very differently under the hood. To take advantages of the devices, the code needs to call them. We've done a lot of that work for you by releasing certain libraries (like tensorflow-neuronx) that implement various functions on the Neuron devices, but the code above them still needs to call them in a way that they support.

However, for TensorFlow, we haven't updated that library recently. https://awsdocs-neuron.readthedocs-hosted.com/en/v2.18.2/release-notes/tensorflow/tensorflow-neuronx/tensorflow-neuronx.html#tensorflow-neuronx-release-notes

Not all the functions in PyTorch are available in Neuron, so that will cause a lot of compiler warnings/errors. It really depends on the code and what it is using. (see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/pytorch-neuron-supported-operators.html for a list of supported operators)

While it may be technically possible to run TensorFlow based training on Neuron, I haven't seen any examples of it being done. If you could get it working, I suspect it will take a lot of updates to the training code.

profile pictureAWS
EXPERT
answered 8 days ago
0

Arnoud,

Inferentia and Trainium both run the same Neuron Core v2 architecture. The main difference between them is the device interconnects.

Despite the name, it may make sense to use a Trainium instance for inference, especially if you are using a larger model.

Anything you can do with TensorFlow on Inferentia you can do on Trainium with the same code.

However, most of our training examples are built around PyTorch or our NeuronX Distributed or Transformers-NeuronX libraries.

Jim

profile pictureAWS
EXPERT
answered 2 months ago
  • Hi Jim, thank you for your answer. That's interesting that you say you can basically use Trainium for inference, and Inferentia for training then, just a difference in hardware optimizations/interconnects?

    I cannot do TensorFlow training with Trn1 hardware though. I install the correct environment with aws_neuronx_venv_tensorflow_2_10) and I can run TensorFlow, yes, but it tries to find Nvidia libraries and then just falls back using Intel AVX. There's no neuronx-cc compiler running. It simply doesn't use the Trainium devices. Only PyTorch actually invokes the neuronx-cc compiler (and there's lots of compiler bugs unfortunately but that's another story).

    Any advice? Can TensorFlow be used with Neuron hardware for training? (not inference)

    Thanks! Arnoud.