Skip to content

V4具体怎么在api.py或api_v2.py里使用呢? #2306

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Open
yangyuke001 opened this issue Apr 21, 2025 · 15 comments
Open

V4具体怎么在api.py或api_v2.py里使用呢? #2306

yangyuke001 opened this issue Apr 21, 2025 · 15 comments

Comments

@yangyuke001
Copy link

感谢大佬开源的好东西【送花花】,请问v4版本具体是怎么在接口里使用呢?

@dignome
Copy link

dignome commented Apr 22, 2025

The information commented at the top of api_v2.py is valid. GPT_SoVITS/configs/tts_infer.yaml contains the last configuration used when running the webui inference (from webui.py or from inference_webui_fast.py). So if you last ran webui inference with a v4 model it should be ready to go in tts_infer.yaml.

`

WebAPI文档

python api_v2.py -a 127.0.0.1 -p 9880 -c GPT_SoVITS/configs/tts_infer.yaml

执行参数:

 -a  -  绑定地址, 默认"127.0.0.1" 
 -p  -  绑定端口, 默认9880 
 -c  -  TTS配置文件路径, 默认"GPT_SoVITS/configs/tts_infer.yaml" 

调用:

推理

endpoint: /tts
GET:

http://127.0.0.1:9880/tts?text=先帝创业未半而中道崩殂,今天下三分,益州疲弊,此诚危急存亡之秋也。&text_lang=zh&ref_audio_path=archive_jingyuan_1.wav&prompt_lang=zh&prompt_text=我是「罗浮」云骑将军景元。不必拘谨,「将军」只是一时的身份,你称呼我景元便可&text_split_method=cut5&batch_size=1&media_type=wav&streaming_mode=true

POST:
json
{
"text": "", # str.(required) text to be synthesized
"text_lang: "", # str.(required) language of the text to be synthesized
"ref_audio_path": "", # str.(required) reference audio path
"aux_ref_audio_paths": [], # list.(optional) auxiliary reference audio paths for multi-speaker tone fusion
"prompt_text": "", # str.(optional) prompt text for the reference audio
"prompt_lang": "", # str.(required) language of the prompt text for the reference audio
"top_k": 5, # int. top k sampling
"top_p": 1, # float. top p sampling
"temperature": 1, # float. temperature for sampling
"text_split_method": "cut0", # str. text split method, see text_segmentation_method.py for details.
"batch_size": 1, # int. batch size for inference
"batch_threshold": 0.75, # float. threshold for batch splitting.
"split_bucket: True, # bool. whether to split the batch into multiple buckets.
"speed_factor":1.0, # float. control the speed of the synthesized audio.
"streaming_mode": False, # bool. whether to return a streaming response.
"seed": -1, # int. random seed for reproducibility.
"parallel_infer": True, # bool. whether to use parallel inference.
"repetition_penalty": 1.35 # float. repetition penalty for T2S model.
"sample_steps": 32, # int. number of sampling steps for VITS model V3.
"super_sampling": False, # bool. whether to use super-sampling for audio when using VITS model V3.
}

RESP:
成功: 直接返回 wav 音频流, http code 200
失败: 返回包含错误信息的 json, http code 400

命令控制

endpoint: /control

command:
"restart": 重新运行
"exit": 结束运行

GET:

http://127.0.0.1:9880/control?command=restart

POST:
json
{
"command": "restart"
}

RESP: 无

切换GPT模型

endpoint: /set_gpt_weights

GET:

http://127.0.0.1:9880/set_gpt_weights?weights_path=GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt

RESP:
成功: 返回"success", http code 200
失败: 返回包含错误信息的 json, http code 400

切换Sovits模型

endpoint: /set_sovits_weights

GET:

http://127.0.0.1:9880/set_sovits_weights?weights_path=GPT_SoVITS/pretrained_models/s2G488k.pth

RESP:
成功: 返回"success", http code 200
失败: 返回包含错误信息的 json, http code 400

`

@yangyuke001
Copy link
Author

@dignome 感谢回复。我看到了tts_infer.yaml已经更新了v4模型,直接调用api_v2.py的结果合成出来的声音是很奇怪的,像是模型没匹配好。我的tts_infer.yaml如下:
custom:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cuda
is_half: true
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v4
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth
v1:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt
version: v1
vits_weights_path: GPT_SoVITS/pretrained_models/s2G488k.pth
v2:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s1bert25hz-5kh-longer-epoch=12-step=369668.ckpt
version: v2
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s2G2333k.pth
v3:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v3
vits_weights_path: GPT_SoVITS/pretrained_models/s2Gv3.pth
v4:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v4
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth

@dignome
Copy link

dignome commented Apr 23, 2025

If there really is a difference you could most likely show this by setting a fixed/static seed value along with matching the other parameters when using both api-v2.py and inference_webui_fast.py. It should produce similar result.

For best speaker reproduction you should finetune a v4 model against a dataset containing at least 10 minutes of audio samples of that speaker using webui.py - then make sure those models are present in your config specified to api-v2.py -c <path/to/your/config.yaml>

@inktree
Copy link

inktree commented Apr 23, 2025

感觉奇怪是正常的,因为v4虽然和v3架构相同但是采样率是不一致的,如果你使用和v3相同的api调用参数是必定会出问题的,你可以自行更改相关部分

我目前自己改的处理逻辑如下

--- V3 Mel 函数定义 ---

mel_fn = lambda x: mel_spectrogram_torch(
x,
**{
"n_fft": 1024,
"win_size": 1024,
"hop_size": 256,
"num_mels": 100,
"sampling_rate": 24000,
"fmin": 0,
"fmax": None,
"center": False,
},
)

--- 添加 V4 Mel 函数定义 ---

mel_fn_v4 = lambda x: mel_spectrogram_torch(
x,
**{
"n_fft": 1280,
"win_size": 1280,
"hop_size": 320,
"num_mels": 100,
"sampling_rate": 32000, # V4 使用 32kHz Mel
"fmin": 0,
"fmax": None,
"center": False,
},
)

    elif version in {"v3", "v4"}: # 判断是否为 v3 或 v4
        logger.info(f"使用 V3/V4 解码逻辑 (vq_model.decode_encp + CFM/Vocoder)...")
        # --- V3/V4 解码逻辑 ---
        # 1. 确定目标采样率和 Mel 函数
        if model_version == "v4":
            tgt_sr = 32000
            current_mel_fn = mel_fn_v4
            logger.info(f"V4 模型:使用 {tgt_sr}Hz 采样率和 V4 Mel 函数。")
        else: # V3
            tgt_sr = 24000
            current_mel_fn = mel_fn
            logger.info(f"V3 模型:使用 {tgt_sr}Hz 采样率和 V3 Mel 函数。")

但是问题在于pyopenjtalk是炸的,最终调用日语还是会报错,很头疼

@xy3xy3
Copy link

xy3xy3 commented Apr 23, 2025

有比较好的解决方案了吗

@wangzai23333
Copy link

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

@yangyuke001
Copy link
Author

yangyuke001 commented Apr 23, 2025

@wangzai23333 @inktree 是的。我自己也尝试去改tts_infer.yaml、api_v2.py里面相关内容,都没有良好的输出,所以想请花儿大佬完善一版api_v2.py 。:)

@dignome
Copy link

dignome commented Apr 24, 2025

So is your issue resolved? api_v2.py worked for you?

@YunZLu
Copy link

YunZLu commented Apr 24, 2025

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

GPT_SoVITS/TTS_infer_pack/TTS.py
第290行的问题:
version = configs.get("version", "v2").lower()

现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

@bensonbs
Copy link

bensonbs commented Apr 25, 2025

api_v4.py made with ChatGPT o3

'''api_v4.py – A lightweight FastAPI wrapper around GPT-SoVITS that defaults to the V4 acoustic stack but remains backward-compatible with v1-v3.

Key fixes (2025-04-24):
• Returned audio is now packed into valid WAV/RAW/OGG/AAC bytes in both streaming and non-stream modes (previous numpy.ndarray bug fixed).
• Shared helper _pack_audio replicates api_v2 logic using soundfile / ffmpeg only when needed.
'''

import os
import sys
import io
import argparse
import subprocess
from typing import Generator, Optional, List, Tuple

import numpy as np
import soundfile as sf
from fastapi import FastAPI, Response, Query
from fastapi.responses import StreamingResponse, JSONResponse
import uvicorn

# ---------------------------------------------------------------------------
# Project paths -------------------------------------------------------------
# ---------------------------------------------------------------------------
now_dir = os.getcwd()
sys.path.extend([now_dir, f"{now_dir}/GPT_SoVITS"])

# ---------------------------------------------------------------------------
# Internal imports ----------------------------------------------------------
# ---------------------------------------------------------------------------
from GPT_SoVITS.TTS_infer_pack.TTS import TTS, TTS_Config, NO_PROMPT_ERROR
from GPT_SoVITS.TTS_infer_pack.text_segmentation_method import get_method_names as _get_cut_methods
from tools.i18n.i18n import I18nAuto

i18n = I18nAuto()
_cut_methods = _get_cut_methods()

MEDIA_TYPES = {"wav", "raw", "ogg", "aac"}
SUPPORTED_VERSIONS = {"v1", "v2", "v3", "v4"}

# ---------------------------------------------------------------------------
# Audio utils ---------------------------------------------------------------
# ---------------------------------------------------------------------------

def _ffmpeg_pack(data: np.ndarray, sr: int, codec: str) -> bytes:
    """Pipe raw PCM through ffmpeg to AAC/OGG"""
    cmd = [
        "ffmpeg",
        "-f", "s16le", "-ar", str(sr), "-ac", "1", "-i", "pipe:0",
        "-c:a", codec, "-b:a", "192k", "-vn", "-f", "adts" if codec == "aac" else "ogg", "pipe:1",
    ]
    proc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, _ = proc.communicate(input=data.tobytes())
    return out


def _pack_audio(data: np.ndarray, sr: int, media_type: str) -> bytes:
    if media_type == "wav":
        buf = io.BytesIO()
        sf.write(buf, data, sr, format="WAV")
        return buf.getvalue()
    if media_type == "raw":
        return data.tobytes()
    if media_type == "aac":
        return _ffmpeg_pack(data, sr, "aac")
    if media_type == "ogg":
        return _ffmpeg_pack(data, sr, "libvorbis")
    raise ValueError(f"Unsupported media_type {media_type}")

# ---------------------------------------------------------------------------
# CLI arguments -------------------------------------------------------------
# ---------------------------------------------------------------------------
parser = argparse.ArgumentParser(description="GPT-SoVITS HTTP API (v4 default)")
parser.add_argument("-c", "--tts_config", default="GPT_SoVITS/configs/tts_infer.yaml",
                    help="Path to tts_infer.yaml")
parser.add_argument("-a", "--bind_addr", default="127.0.0.1", help="Bind address")
parser.add_argument("-p", "--port", type=int, default=9880, help="Port")
args = parser.parse_args()

config_path = args.tts_config
bind_addr   = None if args.bind_addr == "None" else args.bind_addr
port        = args.port

# ---------------------------------------------------------------------------
# Load TTS ------------------------------------------------------------------
# ---------------------------------------------------------------------------
configs = TTS_Config(config_path)
if configs.version != "v4":
    configs.update_version("v4")
print(configs)

tts_pipeline = TTS(configs)

APP = FastAPI(title="GPT-SoVITS TTS API", version="4.0")

# ---------------------------------------------------------------------------
# Pydantic schema -----------------------------------------------------------
# ---------------------------------------------------------------------------
from pydantic import BaseModel, Field

class TTSRequest(BaseModel):
    text: str = Field(..., description="Text to synthesise")
    text_lang: str = Field(..., description="Language of the input text")
    ref_audio_path: str = Field(..., description="Reference wav path")
    aux_ref_audio_paths: Optional[List[str]] = None
    prompt_text: str = ""
    prompt_lang: str = ""
    top_k: int = 5
    top_p: float = 1.0
    temperature: float = 1.0
    text_split_method: str = "cut5"
    batch_size: int = 1
    batch_threshold: float = 0.75
    split_bucket: bool = True
    speed_factor: float = 1.0
    fragment_interval: float = 0.3
    seed: int = -1
    media_type: str = "wav"
    streaming_mode: bool = False
    parallel_infer: bool = True
    repetition_penalty: float = 1.35
    sample_steps: int = 32
    super_sampling: bool = False
    model_version: Optional[str] = Query(None, pattern="^v[1-4]$")

# ---------------------------------------------------------------------------
# Param validation ----------------------------------------------------------
# ---------------------------------------------------------------------------

def _validate(req: dict):
    for k in ("text", "text_lang", "ref_audio_path", "prompt_lang"):
        if not req.get(k):
            return JSONResponse(status_code=400, content={"message": f"{k} is required"})
    if req["media_type"] not in MEDIA_TYPES:
        return JSONResponse(status_code=400, content={"message": f"Unsupported media_type {req['media_type']}"})
    if req["text_split_method"] not in _cut_methods:
        return JSONResponse(status_code=400, content={"message": f"Unknown text_split_method {req['text_split_method']}"})
    return None

# ---------------------------------------------------------------------------
# Core inference ------------------------------------------------------------
# ---------------------------------------------------------------------------
async def _tts_infer(payload: dict):
    # Version hot-swap -------------------------------------------------------
    version = payload.pop("model_version", None)
    if version and version in SUPPORTED_VERSIONS and version != configs.version:
        try:
            tts_pipeline.init_vits_weights(tts_pipeline.configs.default_configs[version]["vits_weights_path"])
            tts_pipeline.init_t2s_weights(tts_pipeline.configs.default_configs[version]["t2s_weights_path"])
            configs.update_version(version)
        except Exception as e:
            return JSONResponse(status_code=400, content={"message": str(e)})

    err = _validate(payload)
    if err:
        return err

    streaming = payload.get("streaming_mode", False)
    media_type = payload.get("media_type", "wav")

    try:
        gen: Generator[Tuple[int, np.ndarray], None, None] = tts_pipeline.run(payload)

        if streaming:
            sr_first: Optional[int] = None
            def _stream():
                nonlocal sr_first
                for sr, arr in gen:
                    if sr_first is None:
                        sr_first = sr
                    yield _pack_audio(arr, sr, media_type)
            return StreamingResponse(_stream(), media_type=f"audio/{media_type}")

        sr, arr = next(gen)
        audio_bytes = _pack_audio(arr, sr, media_type)
        return Response(content=audio_bytes, media_type=f"audio/{media_type}")

    except NO_PROMPT_ERROR as e:
        return JSONResponse(status_code=400, content={"message": str(e)})
    except Exception as e:
        import traceback; traceback.print_exc()
        return JSONResponse(status_code=500, content={"message": "tts failed", "detail": str(e)})

# ---------------------------------------------------------------------------
# Routes --------------------------------------------------------------------
# ---------------------------------------------------------------------------
@APP.post("/tts")
async def tts_post(req: TTSRequest):
    return await _tts_infer(req.dict())

@APP.get("/tts")
async def tts_get(**kwargs):
    return await _tts_infer(kwargs)

@APP.get("/control")
async def control(command: str):
    import signal
    if command == "restart":
        os.execl(sys.executable, sys.executable, *sys.argv)
    elif command == "exit":
        os.kill(os.getpid(), signal.SIGTERM)
    return {"message": "ok"}

@APP.get("/set_gpt_weights")
async def set_gpt(weights_path: str):
    try:
        tts_pipeline.init_t2s_weights(weights_path)
        return "success"
    except Exception as e:
        return JSONResponse(status_code=400, content={"message": str(e)})

@APP.get("/set_sovits_weights")
async def set_sovits(weights_path: str):
    try:
        tts_pipeline.init_vits_weights(weights_path)
        return "success"
    except Exception as e:
        return JSONResponse(status_code=400, content={"message": str(e)})

# ---------------------------------------------------------------------------
# Entrypoint ----------------------------------------------------------------
# ---------------------------------------------------------------------------
if __name__ == "__main__":
    uvicorn.run(APP, host=bind_addr or "0.0.0.0", port=port, workers=1)

@lucasmen9527
Copy link

lucasmen9527 commented Apr 25, 2025

python api_v2.py 即可 api_v2里面有对接口的详细的描述 注意 使用v4训练的模型 用api.py能跑起来 但是调用的时候会报错 audio, _ = librosa.load(filename, int(hps.data.sampling_rate))

Image

用api_v2跑 改了下GPT_SoVITS/configs/tts_infer.yaml 用自定义的模型就行了

Image Image Image

如下图所示 加载成功 看下请求文档 正常发送请求即可 可以使用 apipost apifox postman等接口测试工具测试

@yangyuke001
Copy link
Author

So is your issue resolved? api_v2.py worked for you?

not yet

@yangyuke001
Copy link
Author

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()

现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate,音频出来还是怪得很

@YunZLu
Copy link

YunZLu commented Apr 25, 2025

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()
现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate,音频出来还是怪得很

我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧,如果version为v4,采样率应该是对的啊?

@yangyuke001
Copy link
Author

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()
现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate,音频出来还是怪得很

我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧,如果version为v4,采样率应该是对的啊?

我看作者说V4采样率是48k啊:

“(4)v4修复了v3非整数倍上采样可能导致的电音问题,原生输出48k音频防闷(而v3原生输出只有24k)。作者认为v4是v3的平替,更多还需测试”

No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants