-
Notifications
You must be signed in to change notification settings - Fork 5k
V4具体怎么在api.py或api_v2.py里使用呢? #2306
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Comments
The information commented at the top of api_v2.py is valid. GPT_SoVITS/configs/tts_infer.yaml contains the last configuration used when running the webui inference (from webui.py or from inference_webui_fast.py). So if you last ran webui inference with a v4 model it should be ready to go in tts_infer.yaml. ` WebAPI文档python api_v2.py -a 127.0.0.1 -p 9880 -c GPT_SoVITS/configs/tts_infer.yaml 执行参数:
调用:推理endpoint: /tts POST: RESP: 命令控制endpoint: /control command: GET: http://127.0.0.1:9880/control?command=restart POST: RESP: 无 切换GPT模型endpoint: /set_gpt_weights GET: RESP: 切换Sovits模型endpoint: /set_sovits_weights GET: http://127.0.0.1:9880/set_sovits_weights?weights_path=GPT_SoVITS/pretrained_models/s2G488k.pth RESP: ` |
@dignome 感谢回复。我看到了tts_infer.yaml已经更新了v4模型,直接调用api_v2.py的结果合成出来的声音是很奇怪的,像是模型没匹配好。我的tts_infer.yaml如下: |
If there really is a difference you could most likely show this by setting a fixed/static seed value along with matching the other parameters when using both api-v2.py and inference_webui_fast.py. It should produce similar result. For best speaker reproduction you should finetune a v4 model against a dataset containing at least 10 minutes of audio samples of that speaker using webui.py - then make sure those models are present in your config specified to api-v2.py -c <path/to/your/config.yaml> |
感觉奇怪是正常的,因为v4虽然和v3架构相同但是采样率是不一致的,如果你使用和v3相同的api调用参数是必定会出问题的,你可以自行更改相关部分 我目前自己改的处理逻辑如下 --- V3 Mel 函数定义 ---mel_fn = lambda x: mel_spectrogram_torch( --- 添加 V4 Mel 函数定义 ---mel_fn_v4 = lambda x: mel_spectrogram_torch(
但是问题在于pyopenjtalk是炸的,最终调用日语还是会报错,很头疼 |
有比较好的解决方案了吗 |
好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸 |
@wangzai23333 @inktree 是的。我自己也尝试去改tts_infer.yaml、api_v2.py里面相关内容,都没有良好的输出,所以想请花儿大佬完善一版api_v2.py 。:) |
So is your issue resolved? api_v2.py worked for you? |
GPT_SoVITS/TTS_infer_pack/TTS.py 现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。 |
'''api_v4.py – A lightweight FastAPI wrapper around GPT-SoVITS that defaults to the V4 acoustic stack but remains backward-compatible with v1-v3.
Key fixes (2025-04-24):
• Returned audio is now packed into valid WAV/RAW/OGG/AAC bytes in both streaming and non-stream modes (previous numpy.ndarray bug fixed).
• Shared helper _pack_audio replicates api_v2 logic using soundfile / ffmpeg only when needed.
'''
import os
import sys
import io
import argparse
import subprocess
from typing import Generator, Optional, List, Tuple
import numpy as np
import soundfile as sf
from fastapi import FastAPI, Response, Query
from fastapi.responses import StreamingResponse, JSONResponse
import uvicorn
# ---------------------------------------------------------------------------
# Project paths -------------------------------------------------------------
# ---------------------------------------------------------------------------
now_dir = os.getcwd()
sys.path.extend([now_dir, f"{now_dir}/GPT_SoVITS"])
# ---------------------------------------------------------------------------
# Internal imports ----------------------------------------------------------
# ---------------------------------------------------------------------------
from GPT_SoVITS.TTS_infer_pack.TTS import TTS, TTS_Config, NO_PROMPT_ERROR
from GPT_SoVITS.TTS_infer_pack.text_segmentation_method import get_method_names as _get_cut_methods
from tools.i18n.i18n import I18nAuto
i18n = I18nAuto()
_cut_methods = _get_cut_methods()
MEDIA_TYPES = {"wav", "raw", "ogg", "aac"}
SUPPORTED_VERSIONS = {"v1", "v2", "v3", "v4"}
# ---------------------------------------------------------------------------
# Audio utils ---------------------------------------------------------------
# ---------------------------------------------------------------------------
def _ffmpeg_pack(data: np.ndarray, sr: int, codec: str) -> bytes:
"""Pipe raw PCM through ffmpeg to AAC/OGG"""
cmd = [
"ffmpeg",
"-f", "s16le", "-ar", str(sr), "-ac", "1", "-i", "pipe:0",
"-c:a", codec, "-b:a", "192k", "-vn", "-f", "adts" if codec == "aac" else "ogg", "pipe:1",
]
proc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, _ = proc.communicate(input=data.tobytes())
return out
def _pack_audio(data: np.ndarray, sr: int, media_type: str) -> bytes:
if media_type == "wav":
buf = io.BytesIO()
sf.write(buf, data, sr, format="WAV")
return buf.getvalue()
if media_type == "raw":
return data.tobytes()
if media_type == "aac":
return _ffmpeg_pack(data, sr, "aac")
if media_type == "ogg":
return _ffmpeg_pack(data, sr, "libvorbis")
raise ValueError(f"Unsupported media_type {media_type}")
# ---------------------------------------------------------------------------
# CLI arguments -------------------------------------------------------------
# ---------------------------------------------------------------------------
parser = argparse.ArgumentParser(description="GPT-SoVITS HTTP API (v4 default)")
parser.add_argument("-c", "--tts_config", default="GPT_SoVITS/configs/tts_infer.yaml",
help="Path to tts_infer.yaml")
parser.add_argument("-a", "--bind_addr", default="127.0.0.1", help="Bind address")
parser.add_argument("-p", "--port", type=int, default=9880, help="Port")
args = parser.parse_args()
config_path = args.tts_config
bind_addr = None if args.bind_addr == "None" else args.bind_addr
port = args.port
# ---------------------------------------------------------------------------
# Load TTS ------------------------------------------------------------------
# ---------------------------------------------------------------------------
configs = TTS_Config(config_path)
if configs.version != "v4":
configs.update_version("v4")
print(configs)
tts_pipeline = TTS(configs)
APP = FastAPI(title="GPT-SoVITS TTS API", version="4.0")
# ---------------------------------------------------------------------------
# Pydantic schema -----------------------------------------------------------
# ---------------------------------------------------------------------------
from pydantic import BaseModel, Field
class TTSRequest(BaseModel):
text: str = Field(..., description="Text to synthesise")
text_lang: str = Field(..., description="Language of the input text")
ref_audio_path: str = Field(..., description="Reference wav path")
aux_ref_audio_paths: Optional[List[str]] = None
prompt_text: str = ""
prompt_lang: str = ""
top_k: int = 5
top_p: float = 1.0
temperature: float = 1.0
text_split_method: str = "cut5"
batch_size: int = 1
batch_threshold: float = 0.75
split_bucket: bool = True
speed_factor: float = 1.0
fragment_interval: float = 0.3
seed: int = -1
media_type: str = "wav"
streaming_mode: bool = False
parallel_infer: bool = True
repetition_penalty: float = 1.35
sample_steps: int = 32
super_sampling: bool = False
model_version: Optional[str] = Query(None, pattern="^v[1-4]$")
# ---------------------------------------------------------------------------
# Param validation ----------------------------------------------------------
# ---------------------------------------------------------------------------
def _validate(req: dict):
for k in ("text", "text_lang", "ref_audio_path", "prompt_lang"):
if not req.get(k):
return JSONResponse(status_code=400, content={"message": f"{k} is required"})
if req["media_type"] not in MEDIA_TYPES:
return JSONResponse(status_code=400, content={"message": f"Unsupported media_type {req['media_type']}"})
if req["text_split_method"] not in _cut_methods:
return JSONResponse(status_code=400, content={"message": f"Unknown text_split_method {req['text_split_method']}"})
return None
# ---------------------------------------------------------------------------
# Core inference ------------------------------------------------------------
# ---------------------------------------------------------------------------
async def _tts_infer(payload: dict):
# Version hot-swap -------------------------------------------------------
version = payload.pop("model_version", None)
if version and version in SUPPORTED_VERSIONS and version != configs.version:
try:
tts_pipeline.init_vits_weights(tts_pipeline.configs.default_configs[version]["vits_weights_path"])
tts_pipeline.init_t2s_weights(tts_pipeline.configs.default_configs[version]["t2s_weights_path"])
configs.update_version(version)
except Exception as e:
return JSONResponse(status_code=400, content={"message": str(e)})
err = _validate(payload)
if err:
return err
streaming = payload.get("streaming_mode", False)
media_type = payload.get("media_type", "wav")
try:
gen: Generator[Tuple[int, np.ndarray], None, None] = tts_pipeline.run(payload)
if streaming:
sr_first: Optional[int] = None
def _stream():
nonlocal sr_first
for sr, arr in gen:
if sr_first is None:
sr_first = sr
yield _pack_audio(arr, sr, media_type)
return StreamingResponse(_stream(), media_type=f"audio/{media_type}")
sr, arr = next(gen)
audio_bytes = _pack_audio(arr, sr, media_type)
return Response(content=audio_bytes, media_type=f"audio/{media_type}")
except NO_PROMPT_ERROR as e:
return JSONResponse(status_code=400, content={"message": str(e)})
except Exception as e:
import traceback; traceback.print_exc()
return JSONResponse(status_code=500, content={"message": "tts failed", "detail": str(e)})
# ---------------------------------------------------------------------------
# Routes --------------------------------------------------------------------
# ---------------------------------------------------------------------------
@APP.post("/tts")
async def tts_post(req: TTSRequest):
return await _tts_infer(req.dict())
@APP.get("/tts")
async def tts_get(**kwargs):
return await _tts_infer(kwargs)
@APP.get("/control")
async def control(command: str):
import signal
if command == "restart":
os.execl(sys.executable, sys.executable, *sys.argv)
elif command == "exit":
os.kill(os.getpid(), signal.SIGTERM)
return {"message": "ok"}
@APP.get("/set_gpt_weights")
async def set_gpt(weights_path: str):
try:
tts_pipeline.init_t2s_weights(weights_path)
return "success"
except Exception as e:
return JSONResponse(status_code=400, content={"message": str(e)})
@APP.get("/set_sovits_weights")
async def set_sovits(weights_path: str):
try:
tts_pipeline.init_vits_weights(weights_path)
return "success"
except Exception as e:
return JSONResponse(status_code=400, content={"message": str(e)})
# ---------------------------------------------------------------------------
# Entrypoint ----------------------------------------------------------------
# ---------------------------------------------------------------------------
if __name__ == "__main__":
uvicorn.run(APP, host=bind_addr or "0.0.0.0", port=port, workers=1) |
not yet |
api_v2.py好像改不了sampling_rate,音频出来还是怪得很 |
我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧,如果version为v4,采样率应该是对的啊? |
我看作者说V4采样率是48k啊: “(4)v4修复了v3非整数倍上采样可能导致的电音问题,原生输出48k音频防闷(而v3原生输出只有24k)。作者认为v4是v3的平替,更多还需测试” |
感谢大佬开源的好东西【送花花】,请问v4版本具体是怎么在接口里使用呢?
The text was updated successfully, but these errors were encountered: