NVIDIA Technical Blog

Robust Scene Text Detection and Recognition: Implementation

thumbnail
  • To implement a robust scene text detection and recognition system, it is important to have control over the model for customizations. Incremental learning and fine-tuning can be performed as per specific use cases and datasets.

  • Model inference optimization is crucial for accuracy and low latency. Tools like NVIDIA TensorRT and ONNX Runtime are used for this purpose.

  • To ensure standard model deployment and execution, NVIDIA Triton Inference Server is used, which provides high-performing inference with scalability.

  • Docker container images from the NGC catalog are used for training the models, which is a hub for GPU-optimized AI and ML software.

  • Triton Inference Server utilizes multiple GPUs efficiently by creating an instance of each model on each GPU.

  • The STDR pipeline consists of three main building blocks: scene text detection, scene text recognition, and orchestration.

  • Various text detection algorithms like FCENet, CRAFT, and TextFuseNet can be used and trained/fine-tuned for specific use cases.

  • The state-of-the-art PARSeq algorithm is used for scene text recognition, and incremental learning techniques are employed to fine-tune pretrained models on custom datasets.

  • The orchestration module coordinates between the text detection and recognition modules. It preprocesses input images, performs text detection, crops out text fields, creates batches of cropped text images, and sends them one batch at a time to the text recognition module.

  • The text recognition module returns the output with confidence scores for each cropped text image within the batch.

  • NVIDIA TensorRT, ONNX Runtime, and Triton Inference Server are used for model optimization and high-performance inference serving.

  • This implementation highlights the use of deep learning algorithms and techniques for robust scene text detection and recognition. The CRAFT algorithm is used for text detection, and the PARSeq algorithm is used for text recognition.

  • The use of NVIDIA tools and frameworks ensures efficient model optimization and deployment. For more detailed information, refer to the related posts in this series.