k2-fsa_sherpa-onnx/python-api-examples
Wasser1462 b82d9c6aff
FunASR-nano: switch to unified KV-cache LLM (#2995)
This PR updates FunASR-nano inference from the prefill+decode dual-model pipeline to a single unified KV-cache model.

## Summary

Previously, FunASR-nano required two separate ONNX models:
- `llm_prefill.onnx`
- `llm_decode.onnx`

This PR switches to a single model:
- `llm.onnx`

The new pipeline uses a static KV cache + KV-delta incremental update mechanism, and relies on `cache_position` to differentiate prefill vs. decode steps. This significantly simplifies model/session management and reduces deployment complexity.

## Key changes

- **Single LLM session / single model file**: `llm.onnx` replaces `llm_prefill.onnx` + `llm_decode.onnx`.

- **Unified KV-cache implementation**:
  - static KV cache layout
  - KV-delta update for decode
  - `cache_position` distinguishes prefill vs. decode behavior

- **Config changes (breaking)**:
  - `funasr_nano.llm_prefill` and `funasr_nano.llm_decode` are deprecated/removed
  - use only `funasr_nano.llm`

- **Not backward compatible**:
  - users must re-export models in KV-delta/unified-KV format

- **Trade-off**: slightly slower, but lower VRAM duplication
2026-01-07 10:41:53 +08:00
..
2025-10-10 10:54:32 +08:00
2025-10-10 10:54:32 +08:00
2024-12-04 09:22:24 +08:00

File description