NVIDIA Technical BlogJune 12, 2023

Distributed Deep Learning Made Easy with Spark 3.4

Introduction
Distributed training
Distributed inference
How to use the new API for distributed inference
How to load training data from a distributed file store

1. Introduction

The article discusses Spark 3.4, which enables easy distributed deep learning. Most deep learning frameworks were designed for single-node environments, making their distributed training and inference APIs an after-thought. However, Spark's new TorchDistributor API for PyTorch follows the spark-tensorflow-distributor API for TensorFlow, simplifying the migration of distributed DL model training code to Spark.

2. Distributed training

The TorchDistributor and spark-tensorflow-distributor APIs use Spark's barrier execution mode to spawn distributed DL cluster nodes on top of the Spark executors. Once launched, the processes running on the executors rely on the built-in distributed training APIs of their respective DL frameworks. However, these APIs do not use Spark RDDs or DataFrames for data transfer.

3. Distributed inference

A new API builds on the Spark Pandas UDF to provide a simpler interface for DL model inference. However, the Pandas UDF API might not be ideal for DL inference use cases since it presents data as a Pandas Series or DataFrame. In contrast, most DL frameworks expect either NumPy arrays or standard Python arrays as input, wrapped by custom Tensor variables. Therefore, a Pandas UDF implementation would need to translate the incoming Pandas data to NumPy arrays at a minimum.

4. How to use the new API for distributed inference

This new API hides the complexity of translating DL inferencing code to Spark. The user defines a function using standard DL APIs to load the model and return a function. This API uses the standard Spark DataFrame for inference, so the executors read from the distributed file system and pass data to the function. Any data processing can be done inline with the model prediction.

5. How to load training data from a distributed file store

The article suggests using NVTabular to load training data from a distributed file store like S3.