AWS and Facebook launch an open-source model server for PyTorch


AWS and Facebook today announced two new open-source projects around PyTorch, the popular open-source machine learning framework. The first of these is TorchServe, a model serving framework for PyTorch that will make it easier for developers to put their models into production. The other is TorchElastic, a library that makes it easier for developers to build fault-tolerant training jobs on Kubernetes clusters, including AWS’s EC2 spot instances and Elastic Kubernetes Service.
In many ways, the two companies are taking what they have learned from running their own machine learning systems at scale and are putting this into the project. For AWS, that’s mostly SageMaker, the company’s machine learning platform, but as Bratin Saha, AWS VP and GM for Machine Learning Services, told me, the work on PyTorch was mostly motivated by requests from the community. And while there are obviously other model servers like TensorFlow Serving and the Multi Model Server available today, Saha argues that it would be hard to optimize those for PyTorch.
“If we tried to take some other model server, we would not be able to quote optimize it as much, as well as create it within the nuances of how PyTorch developers like to see this,” he said. AWS has lots of experience in running its own model servers for SageMaker that can handle multiple frameworks, but the community was asking for a model server that was tailored toward how they work. That also meant adapting the server’s API to what PyTorch developers expect from their framework of choice, for example.
As Saha told me, the server that AWS and Facebook are now launching as open source is similar to what AWS is using internally. “It’s quite close,” he said. “We actually started with what we had internally for one of our model servers and then put it out to the community, worked closely with Facebook, to iterate and get feedback — and then modified it so it’s quite close.”
Bill Jia, Facebook’s VP of AI Infrastructure, also told me, he’s very happy about how his team and the community has pushed PyTorch forward in recent years. “If you look at the entire industry community — a large number of researchers and enterprise users are using AWS,” he said. “And then we figured out if we can collaborate with AWS and push PyTorch together, then Facebook and AWS can get a lot of benefits, but more so, all the users can get a lot of benefits from PyTorch. That’s our reason for why we wanted to collaborate with AWS.”
As for TorchElastic, the focus here is on allowing developers to create training systems that can work on large distributed Kubernetes clusters where you might want to use cheaper spot instances. Those are preemptible, though, so your system has to be able to handle that, while traditionally, machine learning training frameworks often expect a system where the number of instances stays the same throughout the process. That, too, is something AWS originally built for SageMaker. There, it’s fully managed by AWS, though, so developers never have to think about it. For developers who want more control over their dynamic training systems or stay very close to the metal, TorchElastic now allows them to recreate this experience on their own Kubernetes clusters.
AWS has a bit of a reputation when it comes to open source and its engagement with the open-source community. In this case, though, it’s nice to see AWS lead the way to bring some of its own work on building model servers, for example, to the PyTorch community. In the machine learning ecosystem, that’s very much expected, and Saha stressed that AWS has long engaged with the community as one of the main contributors to MXNet and through its contributions to projects like Jupyter, TensorFlow and libraries like NumPy.
AWS and Facebook today announced two new open-source projects around PyTorch, the popular open-source machine learning framework. The first of these is TorchServe, a model serving framework for PyTorch that will make it easier for developers to put their models into production. The other is TorchElastic, a library that makes…
Recent Posts
- Silo season 3: Everything we know so far about the Apple TV Plus show
- The iOS 18.4 beta brings Matter robot vacuum support
- Philips Monitors is now offering a whopping 5-year warranty on some of its displays, including a gorgeous KVM-enabled business monitor
- The secretive X-37B space plane snapped this picture of Earth from orbit
- Beyond 100TB, here’s how Western Digital is betting on heat dot magnetic recording to reach the storage skies
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010