Distributed Deep Learning Made Easy with Spark 3.4
Table of Contents
- Introduction
- Distributed training
- Distributed inference
- How to use the new API for distributed inference
- How to load training data from a distributed file store
1. Introduction
The article discusses Spark 3.4, which enables easy distributed deep learning. Most deep learning frameworks were designed for single-node environments, making their distributed training and inference APIs an after-thought. However, Spark's new TorchDistributor API for PyTorch follows the spark-tensorflow-distributor API for TensorFlow, simplifying the migration of distributed DL model training code to Spark.
2. Distributed training
The TorchDistributor and spark-tensorflow-distributor APIs use Spark's barrier execution mode to spawn distributed DL cluster nodes on top of the Spark executors. Once launched, the processes running on the executors rely on the built-in distributed training APIs of their respective DL frameworks. However, these APIs do not use Spark RDDs or DataFrames for data transfer.
3. Distributed inference
A new API builds on the Spark Pandas UDF to provide a simpler interface for DL model inference. However, the Pandas UDF API might not be ideal for DL inference use cases since it presents data as a Pandas Series or DataFrame. In contrast, most DL frameworks expect either NumPy arrays or standard Python arrays as input, wrapped by custom Tensor variables. Therefore, a Pandas UDF implementation would need to translate the incoming Pandas data to NumPy arrays at a minimum.
4. How to use the new API for distributed inference
This new API hides the complexity of translating DL inferencing code to Spark. The user defines a function using standard DL APIs to load the model and return a function. This API uses the standard Spark DataFrame for inference, so the executors read from the distributed file system and pass data to the function. Any data processing can be done inline with the model prediction.
5. How to load training data from a distributed file store
The article suggests using NVTabular to load training data from a distributed file store like S3.
Learn more about this API at the 2023 Data+AI Summit session, "An API for Deep Learning Inferencing on Apache Spark."