Few hours ago Google announced his TensorFlow 0.8 that includes distributed computing support. As we already presented in this blog, distributed TensorFlow is powered by the high-performance gRPC library, which supports training on hundreds of machines in parallel according Google post. It complements the recent announcement of Google Cloud Machine Learning, which enables us to use the Google Cloud Platform.
The post also announces that they have published a distributed trainer for the Inception image classification neural network in the TensorFlow models repository. The distributed trainer also enables us to scale out training using a cluster management system like Kubernetes from Google. Furthermore, once we have trained our model, we can deploy to production and speed up inference using TensorFlow Serving on Kubernetes. Beyond distributed Inception, the 0.8 release includes new libraries for defining our own distributed models.
Using the distributed trainer, they trained the Inception network to 78% accuracy in less than 65 hours using 100 GPUs: