Benchmark and Scaling#

There are many factors that affect the training performance. We provide some benchmark results here.

Scaling to Clusters of GPUs#

We conduct scaling experiments on up to 256 NVIDIA A100 40G GPUs. We thank Juelich Supercomputing Center and Ontocord for their generous support in providing the computing resources.

Benchmark 1: Fine-tuning Llama-7B#

Configuration

Train batch size (GBS): 512 for # GPUs <= 64; Otherwise, GBS = # GPUs * 8. For example, when # GPUs=128, GBS=1024.

Micro batch size (MBS): 8

Sequence Length: 2048

Total Tokens/second

Tokens/GPU/second

Benchmark 2: Fine-tuning Llama-70B#

Train Batch Size

Micro Batch Size

Sequence Length

1024

4

4096

Total Tokens/second

Tokens/GPU/second