mirror of
https://github.com/k2-fsa/sherpa-onnx.git
synced 2026-01-09 07:41:06 +08:00
This PR updates FunASR-nano inference from the prefill+decode dual-model pipeline to a single unified KV-cache model. ## Summary Previously, FunASR-nano required two separate ONNX models: - `llm_prefill.onnx` - `llm_decode.onnx` This PR switches to a single model: - `llm.onnx` The new pipeline uses a static KV cache + KV-delta incremental update mechanism, and relies on `cache_position` to differentiate prefill vs. decode steps. This significantly simplifies model/session management and reduces deployment complexity. ## Key changes - **Single LLM session / single model file**: `llm.onnx` replaces `llm_prefill.onnx` + `llm_decode.onnx`. - **Unified KV-cache implementation**: - static KV cache layout - KV-delta update for decode - `cache_position` distinguishes prefill vs. decode behavior - **Config changes (breaking)**: - `funasr_nano.llm_prefill` and `funasr_nano.llm_decode` are deprecated/removed - use only `funasr_nano.llm` - **Not backward compatible**: - users must re-export models in KV-delta/unified-KV format - **Trade-off**: slightly slower, but lower VRAM duplication
File description
- ./http_server.py It defines which files to server. Files are saved in ./web.
- non_streaming_server.py WebSocket server for non-streaming models.
- vad-remove-non-speech-segments.py It uses silero-vad to remove non-speech segments and concatenate all speech segments into a single one.
- vad-with-non-streaming-asr.py It shows how to use VAD with a non-streaming ASR model for speech recognition from a microphone