INT8量化#

W8A8-INT8 Dynamic量化#

activation采用per-token动态量化,weight离线做per-channel静态量化。

运行示例如下:

python3 tools/run.py -c configs/qwen3/int8_dynamic/qwen3-0_6b_int8_dynamic.yaml

该配置文件中,量化相关参数如下:

  • name:压缩策略,选填量化quantization

  • quantization.name:压缩算法选填int8_dynamic

  • quantization.bits:INT8量化对应填写8bit。

  • quantization.quant_method:主要指定权重和激活的量化粒度为per-tensor

  • quantization.ignore_layers:需要忽略不进行量化的线性层。

compression:
  name: PTQ
  quantization:
    name: int8_dynamic     # Supported: fp8_static, fp8_dynamic, int4_awq, int4_gptq, int8_dynamic
    bits: 8                # Quantization bits
    quant_method:
      weight: "per-channel"
      activation: "per-token"
    ignore_layers:         # Skip quantization for these layers
      - "lm_head"

产出模型#

每个被量化的线性层保存:

  • weight:8位INT数,形状为[input_dim, output_dim]

  • weight_scale:用于反量化的scales,形状为[input_dim, 1]

配置文件config.json中量化相关配置:

"quantization_config": {
  "config_groups": {
    "group_0": {
      "targets": ["Linear"],
      "input_activations": {
        "dynamic": true,
        "num_bits": 8,
        "strategy": "token",
        "type": "int"
      },
      "output_activations": null,
      "weights": {
        "dynamic": false,
        "num_bits": 8,
        "strategy": "channel",
        "type": "int"
      }
    }
  },
  "format": "int-quantized",
  "ignore": [
    "lm_head"
  ],
  "kv_cache_scheme": null,
  "quant_method": "compressed-tensors",
  "quantization_status": "compressed"
}

可参阅vLLM INT8 文档加载INT8量化模型配置要求。