V4具体怎么在api.py或api_v2.py里使用呢？ #2306

yangyuke001 · 2025-04-21T13:14:35Z

感谢大佬开源的好东西【送花花】，请问v4版本具体是怎么在接口里使用呢？

dignome · 2025-04-22T22:04:47Z

The information commented at the top of api_v2.py is valid. GPT_SoVITS/configs/tts_infer.yaml contains the last configuration used when running the webui inference (from webui.py or from inference_webui_fast.py). So if you last ran webui inference with a v4 model it should be ready to go in tts_infer.yaml.

`

WebAPI文档

python api_v2.py -a 127.0.0.1 -p 9880 -c GPT_SoVITS/configs/tts_infer.yaml

执行参数:

 -a  -  绑定地址, 默认"127.0.0.1" 
 -p  -  绑定端口, 默认9880 
 -c  -  TTS配置文件路径, 默认"GPT_SoVITS/configs/tts_infer.yaml"

调用:

推理

endpoint: /tts
GET:

http://127.0.0.1:9880/tts?text=先帝创业未半而中道崩殂，今天下三分，益州疲弊，此诚危急存亡之秋也。&text_lang=zh&ref_audio_path=archive_jingyuan_1.wav&prompt_lang=zh&prompt_text=我是「罗浮」云骑将军景元。不必拘谨，「将军」只是一时的身份，你称呼我景元便可&text_split_method=cut5&batch_size=1&media_type=wav&streaming_mode=true

POST:
json
{
"text": "", # str.(required) text to be synthesized
"text_lang: "", # str.(required) language of the text to be synthesized
"ref_audio_path": "", # str.(required) reference audio path
"aux_ref_audio_paths": [], # list.(optional) auxiliary reference audio paths for multi-speaker tone fusion
"prompt_text": "", # str.(optional) prompt text for the reference audio
"prompt_lang": "", # str.(required) language of the prompt text for the reference audio
"top_k": 5, # int. top k sampling
"top_p": 1, # float. top p sampling
"temperature": 1, # float. temperature for sampling
"text_split_method": "cut0", # str. text split method, see text_segmentation_method.py for details.
"batch_size": 1, # int. batch size for inference
"batch_threshold": 0.75, # float. threshold for batch splitting.
"split_bucket: True, # bool. whether to split the batch into multiple buckets.
"speed_factor":1.0, # float. control the speed of the synthesized audio.
"streaming_mode": False, # bool. whether to return a streaming response.
"seed": -1, # int. random seed for reproducibility.
"parallel_infer": True, # bool. whether to use parallel inference.
"repetition_penalty": 1.35 # float. repetition penalty for T2S model.
"sample_steps": 32, # int. number of sampling steps for VITS model V3.
"super_sampling": False, # bool. whether to use super-sampling for audio when using VITS model V3.
}

RESP:
成功: 直接返回 wav 音频流， http code 200
失败: 返回包含错误信息的 json, http code 400

命令控制

endpoint: /control

command:
"restart": 重新运行
"exit": 结束运行

GET:

http://127.0.0.1:9880/control?command=restart

POST:
json
{
"command": "restart"
}

RESP: 无

切换GPT模型

endpoint: /set_gpt_weights

GET:

http://127.0.0.1:9880/set_gpt_weights?weights_path=GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt

RESP:
成功: 返回"success", http code 200
失败: 返回包含错误信息的 json, http code 400

切换Sovits模型

endpoint: /set_sovits_weights

GET:

http://127.0.0.1:9880/set_sovits_weights?weights_path=GPT_SoVITS/pretrained_models/s2G488k.pth

RESP:
成功: 返回"success", http code 200
失败: 返回包含错误信息的 json, http code 400

`

yangyuke001 · 2025-04-23T02:34:33Z

@dignome 感谢回复。我看到了tts_infer.yaml已经更新了v4模型，直接调用api_v2.py的结果合成出来的声音是很奇怪的，像是模型没匹配好。我的tts_infer.yaml如下：
custom:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cuda
is_half: true
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v4
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth
v1:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt
version: v1
vits_weights_path: GPT_SoVITS/pretrained_models/s2G488k.pth
v2:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s1bert25hz-5kh-longer-epoch=12-step=369668.ckpt
version: v2
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s2G2333k.pth
v3:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v3
vits_weights_path: GPT_SoVITS/pretrained_models/s2Gv3.pth
v4:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v4
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth

dignome · 2025-04-23T02:53:32Z

If there really is a difference you could most likely show this by setting a fixed/static seed value along with matching the other parameters when using both api-v2.py and inference_webui_fast.py. It should produce similar result.

For best speaker reproduction you should finetune a v4 model against a dataset containing at least 10 minutes of audio samples of that speaker using webui.py - then make sure those models are present in your config specified to api-v2.py -c <path/to/your/config.yaml>

inktree · 2025-04-23T03:46:48Z

感觉奇怪是正常的，因为v4虽然和v3架构相同但是采样率是不一致的，如果你使用和v3相同的api调用参数是必定会出问题的，你可以自行更改相关部分

我目前自己改的处理逻辑如下

--- V3 Mel 函数定义 ---

mel_fn = lambda x: mel_spectrogram_torch(
x,
**{
"n_fft": 1024,
"win_size": 1024,
"hop_size": 256,
"num_mels": 100,
"sampling_rate": 24000,
"fmin": 0,
"fmax": None,
"center": False,
},
)

--- 添加 V4 Mel 函数定义 ---

mel_fn_v4 = lambda x: mel_spectrogram_torch(
x,
**{
"n_fft": 1280,
"win_size": 1280,
"hop_size": 320,
"num_mels": 100,
"sampling_rate": 32000, # V4 使用 32kHz Mel
"fmin": 0,
"fmax": None,
"center": False,
},
)

    elif version in {"v3", "v4"}: # 判断是否为 v3 或 v4
        logger.info(f"使用 V3/V4 解码逻辑 (vq_model.decode_encp + CFM/Vocoder)...")
        # --- V3/V4 解码逻辑 ---
        # 1. 确定目标采样率和 Mel 函数
        if model_version == "v4":
            tgt_sr = 32000
            current_mel_fn = mel_fn_v4
            logger.info(f"V4 模型：使用 {tgt_sr}Hz 采样率和 V4 Mel 函数。")
        else: # V3
            tgt_sr = 24000
            current_mel_fn = mel_fn
            logger.info(f"V3 模型：使用 {tgt_sr}Hz 采样率和 V3 Mel 函数。")

但是问题在于pyopenjtalk是炸的，最终调用日语还是会报错，很头疼

xy3xy3 · 2025-04-23T06:16:48Z

有比较好的解决方案了吗

wangzai23333 · 2025-04-23T06:25:57Z

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

yangyuke001 · 2025-04-23T08:36:30Z

@wangzai23333 @inktree 是的。我自己也尝试去改tts_infer.yaml、api_v2.py里面相关内容，都没有良好的输出，所以想请花儿大佬完善一版api_v2.py 。：）

dignome · 2025-04-24T13:46:45Z

So is your issue resolved? api_v2.py worked for you?

YunZLu · 2025-04-24T23:45:52Z

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

GPT_SoVITS/TTS_infer_pack/TTS.py
第290行的问题:
version = configs.get("version", "v2").lower()

现在的tts_infer.yaml中version不在根层级，所以get不到，用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

bensonbs · 2025-04-25T00:50:04Z

api_v4.py made with ChatGPT o3

'''api_v4.py – A lightweight FastAPI wrapper around GPT-SoVITS that defaults to the V4 acoustic stack but remains backward-compatible with v1-v3.

Key fixes (2025-04-24):
• Returned audio is now packed into valid WAV/RAW/OGG/AAC bytes in both streaming and non-stream modes (previous numpy.ndarray bug fixed).
• Shared helper _pack_audio replicates api_v2 logic using soundfile / ffmpeg only when needed.
'''

import os
import sys
import io
import argparse
import subprocess
from typing import Generator, Optional, List, Tuple

import numpy as np
import soundfile as sf
from fastapi import FastAPI, Response, Query
from fastapi.responses import StreamingResponse, JSONResponse
import uvicorn

# ---------------------------------------------------------------------------
# Project paths -------------------------------------------------------------
# ---------------------------------------------------------------------------
now_dir = os.getcwd()
sys.path.extend([now_dir, f"{now_dir}/GPT_SoVITS"])

# ---------------------------------------------------------------------------
# Internal imports ----------------------------------------------------------
# ---------------------------------------------------------------------------
from GPT_SoVITS.TTS_infer_pack.TTS import TTS, TTS_Config, NO_PROMPT_ERROR
from GPT_SoVITS.TTS_infer_pack.text_segmentation_method import get_method_names as _get_cut_methods
from tools.i18n.i18n import I18nAuto

i18n = I18nAuto()
_cut_methods = _get_cut_methods()

MEDIA_TYPES = {"wav", "raw", "ogg", "aac"}
SUPPORTED_VERSIONS = {"v1", "v2", "v3", "v4"}

# ---------------------------------------------------------------------------
# Audio utils ---------------------------------------------------------------
# ---------------------------------------------------------------------------

def _ffmpeg_pack(data: np.ndarray, sr: int, codec: str) -> bytes:
    """Pipe raw PCM through ffmpeg to AAC/OGG"""
    cmd = [
        "ffmpeg",
        "-f", "s16le", "-ar", str(sr), "-ac", "1", "-i", "pipe:0",
        "-c:a", codec, "-b:a", "192k", "-vn", "-f", "adts" if codec == "aac" else "ogg", "pipe:1",
    ]
    proc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, _ = proc.communicate(input=data.tobytes())
    return out


def _pack_audio(data: np.ndarray, sr: int, media_type: str) -> bytes:
    if media_type == "wav":
        buf = io.BytesIO()
        sf.write(buf, data, sr, format="WAV")
        return buf.getvalue()
    if media_type == "raw":
        return data.tobytes()
    if media_type == "aac":
        return _ffmpeg_pack(data, sr, "aac")
    if media_type == "ogg":
        return _ffmpeg_pack(data, sr, "libvorbis")
    raise ValueError(f"Unsupported media_type {media_type}")

# ---------------------------------------------------------------------------
# CLI arguments -------------------------------------------------------------
# ---------------------------------------------------------------------------
parser = argparse.ArgumentParser(description="GPT-SoVITS HTTP API (v4 default)")
parser.add_argument("-c", "--tts_config", default="GPT_SoVITS/configs/tts_infer.yaml",
                    help="Path to tts_infer.yaml")
parser.add_argument("-a", "--bind_addr", default="127.0.0.1", help="Bind address")
parser.add_argument("-p", "--port", type=int, default=9880, help="Port")
args = parser.parse_args()

config_path = args.tts_config
bind_addr   = None if args.bind_addr == "None" else args.bind_addr
port        = args.port

# ---------------------------------------------------------------------------
# Load TTS ------------------------------------------------------------------
# ---------------------------------------------------------------------------
configs = TTS_Config(config_path)
if configs.version != "v4":
    configs.update_version("v4")
print(configs)

tts_pipeline = TTS(configs)

APP = FastAPI(title="GPT-SoVITS TTS API", version="4.0")

# ---------------------------------------------------------------------------
# Pydantic schema -----------------------------------------------------------
# ---------------------------------------------------------------------------
from pydantic import BaseModel, Field

class TTSRequest(BaseModel):
    text: str = Field(..., description="Text to synthesise")
    text_lang: str = Field(..., description="Language of the input text")
    ref_audio_path: str = Field(..., description="Reference wav path")
    aux_ref_audio_paths: Optional[List[str]] = None
    prompt_text: str = ""
    prompt_lang: str = ""
    top_k: int = 5
    top_p: float = 1.0
    temperature: float = 1.0
    text_split_method: str = "cut5"
    batch_size: int = 1
    batch_threshold: float = 0.75
    split_bucket: bool = True
    speed_factor: float = 1.0
    fragment_interval: float = 0.3
    seed: int = -1
    media_type: str = "wav"
    streaming_mode: bool = False
    parallel_infer: bool = True
    repetition_penalty: float = 1.35
    sample_steps: int = 32
    super_sampling: bool = False
    model_version: Optional[str] = Query(None, pattern="^v[1-4]$")

# ---------------------------------------------------------------------------
# Param validation ----------------------------------------------------------
# ---------------------------------------------------------------------------

def _validate(req: dict):
    for k in ("text", "text_lang", "ref_audio_path", "prompt_lang"):
        if not req.get(k):
            return JSONResponse(status_code=400, content={"message": f"{k} is required"})
    if req["media_type"] not in MEDIA_TYPES:
        return JSONResponse(status_code=400, content={"message": f"Unsupported media_type {req['media_type']}"})
    if req["text_split_method"] not in _cut_methods:
        return JSONResponse(status_code=400, content={"message": f"Unknown text_split_method {req['text_split_method']}"})
    return None

# ---------------------------------------------------------------------------
# Core inference ------------------------------------------------------------
# ---------------------------------------------------------------------------
async def _tts_infer(payload: dict):
    # Version hot-swap -------------------------------------------------------
    version = payload.pop("model_version", None)
    if version and version in SUPPORTED_VERSIONS and version != configs.version:
        try:
            tts_pipeline.init_vits_weights(tts_pipeline.configs.default_configs[version]["vits_weights_path"])
            tts_pipeline.init_t2s_weights(tts_pipeline.configs.default_configs[version]["t2s_weights_path"])
            configs.update_version(version)
        except Exception as e:
            return JSONResponse(status_code=400, content={"message": str(e)})

    err = _validate(payload)
    if err:
        return err

    streaming = payload.get("streaming_mode", False)
    media_type = payload.get("media_type", "wav")

    try:
        gen: Generator[Tuple[int, np.ndarray], None, None] = tts_pipeline.run(payload)

        if streaming:
            sr_first: Optional[int] = None
            def _stream():
                nonlocal sr_first
                for sr, arr in gen:
                    if sr_first is None:
                        sr_first = sr
                    yield _pack_audio(arr, sr, media_type)
            return StreamingResponse(_stream(), media_type=f"audio/{media_type}")

        sr, arr = next(gen)
        audio_bytes = _pack_audio(arr, sr, media_type)
        return Response(content=audio_bytes, media_type=f"audio/{media_type}")

    except NO_PROMPT_ERROR as e:
        return JSONResponse(status_code=400, content={"message": str(e)})
    except Exception as e:
        import traceback; traceback.print_exc()
        return JSONResponse(status_code=500, content={"message": "tts failed", "detail": str(e)})

# ---------------------------------------------------------------------------
# Routes --------------------------------------------------------------------
# ---------------------------------------------------------------------------
@APP.post("/tts")
async def tts_post(req: TTSRequest):
    return await _tts_infer(req.dict())

@APP.get("/tts")
async def tts_get(**kwargs):
    return await _tts_infer(kwargs)

@APP.get("/control")
async def control(command: str):
    import signal
    if command == "restart":
        os.execl(sys.executable, sys.executable, *sys.argv)
    elif command == "exit":
        os.kill(os.getpid(), signal.SIGTERM)
    return {"message": "ok"}

@APP.get("/set_gpt_weights")
async def set_gpt(weights_path: str):
    try:
        tts_pipeline.init_t2s_weights(weights_path)
        return "success"
    except Exception as e:
        return JSONResponse(status_code=400, content={"message": str(e)})

@APP.get("/set_sovits_weights")
async def set_sovits(weights_path: str):
    try:
        tts_pipeline.init_vits_weights(weights_path)
        return "success"
    except Exception as e:
        return JSONResponse(status_code=400, content={"message": str(e)})

# ---------------------------------------------------------------------------
# Entrypoint ----------------------------------------------------------------
# ---------------------------------------------------------------------------
if __name__ == "__main__":
    uvicorn.run(APP, host=bind_addr or "0.0.0.0", port=port, workers=1)

lucasmen9527 · 2025-04-25T02:00:03Z

python api_v2.py 即可 api_v2里面有对接口的详细的描述注意使用v4训练的模型用api.py能跑起来但是调用的时候会报错 audio, _ = librosa.load(filename, int(hps.data.sampling_rate))

用api_v2跑改了下GPT_SoVITS/configs/tts_infer.yaml 用自定义的模型就行了

如下图所示加载成功看下请求文档正常发送请求即可可以使用 apipost apifox postman等接口测试工具测试

yangyuke001 · 2025-04-25T03:54:26Z

So is your issue resolved? api_v2.py worked for you?

not yet

yangyuke001 · 2025-04-25T03:58:04Z

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()

现在的tts_infer.yaml中version不在根层级，所以get不到，用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate，音频出来还是怪得很

YunZLu · 2025-04-25T05:35:42Z

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()
现在的tts_infer.yaml中version不在根层级，所以get不到，用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate，音频出来还是怪得很

我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧，如果version为v4，采样率应该是对的啊?

yangyuke001 · 2025-04-27T10:41:05Z

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()
现在的tts_infer.yaml中version不在根层级，所以get不到，用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate，音频出来还是怪得很

我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧，如果version为v4，采样率应该是对的啊?

我看作者说V4采样率是48k啊：

“（4）v4修复了v3非整数倍上采样可能导致的电音问题，原生输出48k音频防闷（而v3原生输出只有24k）。作者认为v4是v3的平替，更多还需测试”

yangyuke001 mentioned this issue Apr 24, 2025

花儿大佬太强了！v4真的没有电音了！ #2314

Closed

lucasmen9527 mentioned this issue Apr 25, 2025

关于音频加载失败 #2316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V4具体怎么在api.py或api_v2.py里使用呢？ #2306

V4具体怎么在api.py或api_v2.py里使用呢？ #2306

yangyuke001 commented Apr 21, 2025

dignome commented Apr 22, 2025 •

edited

Loading

yangyuke001 commented Apr 23, 2025

dignome commented Apr 23, 2025

inktree commented Apr 23, 2025 •

edited

Loading

xy3xy3 commented Apr 23, 2025

wangzai23333 commented Apr 23, 2025

yangyuke001 commented Apr 23, 2025 •

edited

Loading

dignome commented Apr 24, 2025

YunZLu commented Apr 24, 2025 •

edited

Loading

bensonbs commented Apr 25, 2025 •

edited

Loading

lucasmen9527 commented Apr 25, 2025 •

edited

Loading

yangyuke001 commented Apr 25, 2025

yangyuke001 commented Apr 25, 2025

YunZLu commented Apr 25, 2025 •

edited

Loading

yangyuke001 commented Apr 27, 2025

V4具体怎么在api.py或api_v2.py里使用呢？ #2306

V4具体怎么在api.py或api_v2.py里使用呢？ #2306

Comments

yangyuke001 commented Apr 21, 2025

dignome commented Apr 22, 2025 • edited Loading

WebAPI文档

执行参数:

调用:

推理

命令控制

切换GPT模型

切换Sovits模型

yangyuke001 commented Apr 23, 2025

dignome commented Apr 23, 2025

inktree commented Apr 23, 2025 • edited Loading

--- V3 Mel 函数定义 ---

--- 添加 V4 Mel 函数定义 ---

xy3xy3 commented Apr 23, 2025

wangzai23333 commented Apr 23, 2025

yangyuke001 commented Apr 23, 2025 • edited Loading

dignome commented Apr 24, 2025

YunZLu commented Apr 24, 2025 • edited Loading

bensonbs commented Apr 25, 2025 • edited Loading

lucasmen9527 commented Apr 25, 2025 • edited Loading

yangyuke001 commented Apr 25, 2025

yangyuke001 commented Apr 25, 2025

YunZLu commented Apr 25, 2025 • edited Loading

yangyuke001 commented Apr 27, 2025

dignome commented Apr 22, 2025 •

edited

Loading

inktree commented Apr 23, 2025 •

edited

Loading

yangyuke001 commented Apr 23, 2025 •

edited

Loading

YunZLu commented Apr 24, 2025 •

edited

Loading

bensonbs commented Apr 25, 2025 •

edited

Loading

lucasmen9527 commented Apr 25, 2025 •

edited

Loading

YunZLu commented Apr 25, 2025 •

edited

Loading