-
Notifications
You must be signed in to change notification settings - Fork 94
[Bug] Investigate PyGILState_Release issue for client #300
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Comments
Hi @cretz . I've came across the similar problem. Here is a simple code to reproduce bug: import asyncio
import logging
from temporalio.client import Client
from temporalio.exceptions import WorkflowAlreadyStartedError
from temporalio.runtime import (LoggingConfig, Runtime, TelemetryConfig, TelemetryFilter)
async def connect_client(host):
temporal_client: Client = await Client.connect(
host,
namespace='default',
runtime=Runtime(
telemetry=TelemetryConfig(
logging=LoggingConfig(
filter=TelemetryFilter(
core_level='DEBUG',
other_level='DEBUG',
),
),
),
),
)
try:
await temporal_client.start_workflow('PackagerWorkflow', 'test', id='Pack',
task_queue='default_queue')
except WorkflowAlreadyStartedError:
pass
def run(self):
logging.basicConfig(level=logging.DEBUG, force=True)
logging.info('start app')
asyncio.run(connect_client('localhost:7233'))
logging.info('finish app')
if __name__ == "__main__":
run() log from gdb:
Full stack trace of failed thread: https://gist.github.com/TiunovNN/84c253d62f03f11270a49facee6c7b52 I suppose that root cause is the python does not call destructor immediately, and as written in docs It would be better if there was an explicit way to close client. |
I am struggling to replicate. There is a slight bug where
I don't think this would solve it if the issue is Tokio runtime cleanup. A better solution may be like what hydro-project/hydro#699 did (linked from hydro-project/hydro#619 who linked here as having a common issue). Specifically using something like But ideally we could get a reliable replication first so we can confirm it is fixed once we make the change. I am struggling to get this currently. |
My mistake. I have forgot to mention that this bug appears not every run, but rarely. So you have to rerun script several times. $ gdb python
(gdb) set pagination off
(gdb) break _exit
(gdb) command
> run
> end
(gdb) run script.py And wait for several minutes.
The bug appears on the following sets:
|
Thanks! We will increase the priority on this. |
Please, take into account, that client destructor might be called even during execution process (not only at exit), when there is no references to the object. I would recommend to use context manager as appropriate solution for python. |
This is not just about client destructor, this is about when it is dropped Rust side. If the client is in use in any way (Python or Rust), it is not closed/disconnected. We can't just use context manager because it is passed around (e.g. to a worker) and may only retain Rust references. I think this error is specific to the Python-Rust bridge and PyO3. |
Update here, I am still struggling to replicate. I suspect this is actually a Tokio runtime lifetime issue and not specific to the client, but we are still trying to replicate reliably (running the script under gdb for me, at least on 3.10, just continually works forever). |
I can replicate a similar issue changing the code to just run a task in the background and then completing the app before waiting on complete, e.g.: import asyncio
import logging
import uuid
from temporalio.api.workflowservice.v1 import StartWorkflowExecutionRequest
from temporalio.client import Client
from temporalio.exceptions import WorkflowAlreadyStartedError
from temporalio.runtime import LoggingConfig, Runtime, TelemetryConfig, TelemetryFilter
async def connect_client(host):
client = await Client.connect(
host,
namespace="default",
runtime=Runtime(
telemetry=TelemetryConfig(
logging=LoggingConfig(
filter=TelemetryFilter(
core_level="DEBUG",
other_level="DEBUG",
),
),
),
),
)
req = StartWorkflowExecutionRequest(
namespace="default",
workflow_id=f"my-workflow-id-{uuid.uuid4()}",
identity="my-identity",
request_id=str(uuid.uuid4()),
)
req.workflow_type.name = "my-workflow"
req.task_queue.name = "my-task-queue"
asyncio.create_task(client.workflow_service.start_workflow_execution(req))
def run():
logging.basicConfig(level=logging.DEBUG, force=True)
logging.info("start app")
asyncio.run(connect_client("localhost:7233"))
logging.info("finish app")
if __name__ == "__main__":
run() This will give:
My current guess is this is because pyo3-asyncio assumes
Unfortunately Unfortunately https://github.com/awestlake87/pyo3-asyncio is mostly abandoned, https://github.com/PyO3/pyo3-async-runtimes is just a simple fork, and built-in PyO3 asyncio with Rust futures is still being worked on. So we probably would just end up vendoring what we haven't already from pyo3-asyncio (we already copy and slightly alter some code to use the current Tokio runtime instead of one big static global). |
@cretz |
Assuming our guess at the issue is correct, fixing this is daunting because it probably requires rewriting/vendoring a Rust library (pyo3-asyncio). Can you confirm this only happens on process/event-loop shutdown? The best way to avoid this is to not attempt to gracefully exit the program while the client is in use somewhere. |
Describe the bug
On short-lived client-only process after
execute_workflow
client call completes, one user reported getting:Maybe this is caused by process death while waiting on client call complete? Try to replicate.
There is some discussion at PyO3/pyo3#1274 that predates pyo3-asyncio. Maybe I am not implementing our custom Tokio pyo3 asyncio extension properly?
It looks like we shouldn't be calling
Python::with_gil
in callbacks (i.e. not in Python-owned thread) for any reason, so we need to work around that. But https://pyo3.rs/main/ecosystem/async-await.html#awaiting-a-rust-future-in-python shows it used in a callback.First thing is a replication, then we can see whether an pyo3-asyncio upgrade can help.
The text was updated successfully, but these errors were encountered: