- Newest
- Most votes
- Most comments
Hi,
At this moment, you can use TensorFlow on Trainium via the Neuron Framework.
See:
- https://awsdocs-neuron.readthedocs-hosted.com/en/v2.9.1/frameworks/tensorflow/index.html
- https://repost.aws/articles/AR5rLph4rOQae4So67dIT54g/accelerating-sagemaker-training-jobs-running-on-aws-trainium
Best,
Didier
@arnoud The Neuron devices are not a drop in replacement, they work very differently under the hood. To take advantages of the devices, the code needs to call them. We've done a lot of that work for you by releasing certain libraries (like tensorflow-neuronx) that implement various functions on the Neuron devices, but the code above them still needs to call them in a way that they support.
However, for TensorFlow, we haven't updated that library recently. https://awsdocs-neuron.readthedocs-hosted.com/en/v2.18.2/release-notes/tensorflow/tensorflow-neuronx/tensorflow-neuronx.html#tensorflow-neuronx-release-notes
Not all the functions in PyTorch are available in Neuron, so that will cause a lot of compiler warnings/errors. It really depends on the code and what it is using. (see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/pytorch-neuron-supported-operators.html for a list of supported operators)
While it may be technically possible to run TensorFlow based training on Neuron, I haven't seen any examples of it being done. If you could get it working, I suspect it will take a lot of updates to the training code.
Arnoud,
Inferentia and Trainium both run the same Neuron Core v2 architecture. The main difference between them is the device interconnects.
Despite the name, it may make sense to use a Trainium instance for inference, especially if you are using a larger model.
Anything you can do with TensorFlow on Inferentia you can do on Trainium with the same code.
However, most of our training examples are built around PyTorch or our NeuronX Distributed or Transformers-NeuronX libraries.
Jim
Hi Jim, thank you for your answer. That's interesting that you say you can basically use Trainium for inference, and Inferentia for training then, just a difference in hardware optimizations/interconnects?
I cannot do TensorFlow training with Trn1 hardware though. I install the correct environment with aws_neuronx_venv_tensorflow_2_10) and I can run TensorFlow, yes, but it tries to find Nvidia libraries and then just falls back using Intel AVX. There's no neuronx-cc compiler running. It simply doesn't use the Trainium devices. Only PyTorch actually invokes the neuronx-cc compiler (and there's lots of compiler bugs unfortunately but that's another story).
Any advice? Can TensorFlow be used with Neuron hardware for training? (not inference)
Thanks! Arnoud.
Relevant content
- asked 2 years ago
- asked 9 months ago
- asked 5 months ago
- How do I resolve the storage full issue on my RDS for MySQL instance or my RDS for MariaDB instance?AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
Thanks for answering! This particular version you refer to (2.9.1) explicitly says: "TensorFlow Neuron support for Training is coming soon." The current version 2.18.2 (aka latest as in my link above) doesn't say this explicitly anymore but lacks any link to docs/samples. Only for Inference, not for Training.