NVIDIA Technical Blog

Developing a 172B LLM with Strong Japanese Capabilities Using NVIDIA Megatron-LM

thumbnail

Table of Contents

  1. Introduction
  2. LLM-jp Initiatives at GENIAC
  3. Model Architecture and Training Settings
  4. Training Efficiency and Acceleration with Megatron-LM

Introduction

Training a 172B Large Language Model (LLM) with strong Japanese capabilities using NVIDIA Megatron-LM has been a challenging yet essential endeavor. This post delves into insights gained from training an AI model as part of the Generative AI Accelerator Challenge (GENIAC) project, specifically focusing on addressing the scarcity of high-performance models for Japanese language understanding.

LLM-jp Initiatives at GENIAC

The Ministry of Economy, Trade and Industry (METI) initiated the GENIAC project to elevate platform model development capabilities in Japan. The LLM-jp 172B model, developed from February to August 2024, marked a significant milestone in Japanese model development. Researchers spearheaded the LLM-jp initiative to accumulate expertise on training principles and model efficiency, utilizing Megatron-Core and Transformer Engine for large-scale training.

Model Architecture and Training Settings

The LLM-jp 172B model architecture, following Llama 2 architecture, is trained from scratch on a multilingual corpus predominantly comprising Japanese and English data. Megatron-Core v0.6 and Transformer Engine v1.4 were instrumental in the training process. The model utilized BF16 plus TE for initial training and transitioned to FP8 hybrid for enhanced stability and speed. Training efficiency was observed at 545-553 TFLOP/s, showcasing promising Japanese language capabilities.

Training Efficiency and Acceleration with Megatron-LM

Efficient training frameworks like Megatron-LM prove vital in expediting Generative AI research. The 172B model training with Megatron-LM explored FP8-hybrid training, resulting in a 1.4x acceleration in training speeds from 400 TFLOP/s to 550 TFLOP/s. This acceleration highlights the efficacy of FP8-hybrid in optimizing the efficiency of large-scale model pretraining processes.