[Bug] Investigate PyGILState_Release issue for client #300

cretz · 2023-03-22T14:13:35Z

Describe the bug

On short-lived client-only process after execute_workflow client call completes, one user reported getting:

Fatal Python error: PyGILState_Release: thread state 0x***************** must be current when releasing
Python runtime state: finalizing (tstate=0x*****************)
Thread 0x****************** (most recent call first):
  <no Python frame>

Maybe this is caused by process death while waiting on client call complete? Try to replicate.

There is some discussion at PyO3/pyo3#1274 that predates pyo3-asyncio. Maybe I am not implementing our custom Tokio pyo3 asyncio extension properly?

It looks like we shouldn't be calling Python::with_gil in callbacks (i.e. not in Python-owned thread) for any reason, so we need to work around that. But https://pyo3.rs/main/ecosystem/async-await.html#awaiting-a-rust-future-in-python shows it used in a callback.

First thing is a replication, then we can see whether an pyo3-asyncio upgrade can help.

The text was updated successfully, but these errors were encountered:

TiunovNN · 2024-09-25T09:47:15Z

Hi @cretz . I've came across the similar problem. Here is a simple code to reproduce bug:

import asyncio
import logging

from temporalio.client import Client
from temporalio.exceptions import WorkflowAlreadyStartedError
from temporalio.runtime import (LoggingConfig, Runtime, TelemetryConfig, TelemetryFilter)


async def connect_client(host):
    temporal_client: Client = await Client.connect(
        host,
        namespace='default',
        runtime=Runtime(
            telemetry=TelemetryConfig(
                logging=LoggingConfig(
                    filter=TelemetryFilter(
                        core_level='DEBUG',
                        other_level='DEBUG',
                    ),
                ),
            ),
        ),
    )

    try:
        await temporal_client.start_workflow('PackagerWorkflow', 'test', id='Pack',
                                             task_queue='default_queue')
    except WorkflowAlreadyStartedError:
        pass


def run(self):
    logging.basicConfig(level=logging.DEBUG, force=True)
    logging.info('start app')
    asyncio.run(connect_client('localhost:7233'))
    logging.info('finish app')

if __name__ == "__main__":
    run()

log from gdb:

2024-09-25T09:32:59.616938Z DEBUG hyper_util::client::legacy::connect::dns: resolving host="localhost"                                                                                                            
2024-09-25T09:32:59.657156Z DEBUG hyper_util::client::legacy::connect::http: connecting to [::1]:7233                                                                                                             
2024-09-25T09:32:59.657644Z DEBUG hyper_util::client::legacy::connect::http: connecting to 127.0.0.1:7233                                                                                                         
2024-09-25T09:32:59.657936Z DEBUG hyper_util::client::legacy::connect::http: connected to 127.0.0.1:7233                                                                                                          
2024-09-25T09:32:59.657996Z DEBUG h2::client: binding client connection                                                                                                                                           
2024-09-25T09:32:59.658088Z DEBUG h2::client: client connection bound                                                                                                                                             
2024-09-25T09:32:59.658129Z DEBUG h2::codec::framed_write: send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384, max_header_list_size: 16384 }
2024-09-25T09:32:59.658413Z DEBUG Connection: h2::codec::framed_write: send frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5177345 } peer=Client
2024-09-25T09:32:59.658430Z DEBUG tower::buffer::worker: service.ready=true processing request                                                                                                                    
2024-09-25T09:32:59.658655Z DEBUG Connection: h2::codec::framed_read: received frame=Settings { flags: (0x0), max_frame_size: 16384 } peer=Client      
2024-09-25T09:32:59.658704Z DEBUG Connection: h2::codec::framed_write: send frame=Settings { flags: (0x1: ACK) } peer=Client      
2024-09-25T09:32:59.658715Z DEBUG Connection: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) } peer=Client                                                       
2024-09-25T09:32:59.658763Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1) } peer=Client                                                                                     
2024-09-25T09:32:59.658773Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) } peer=Client    
2024-09-25T09:32:59.658843Z DEBUG Connection: h2::codec::framed_read: received frame=Settings { flags: (0x1: ACK) } peer=Client                                                                                   
2024-09-25T09:32:59.658872Z DEBUG Connection: h2::proto::settings: received settings ACK; applying Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384, max_header_list_s
ize: 16384 } peer=Client                                                                                                                                                                                          
2024-09-25T09:32:59.659016Z DEBUG Connection: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5 } peer=Client
2024-09-25T09:32:59.659059Z DEBUG Connection: h2::codec::framed_read: received frame=Ping { ack: false, payload: [2, 4, 16, 16, 9, 14, 7, 7] } peer=Client                 
2024-09-25T09:32:59.659073Z DEBUG Connection: h2::codec::framed_write: send frame=Ping { ack: true, payload: [2, 4, 16, 16, 9, 14, 7, 7] } peer=Client                                                            
2024-09-25T09:32:59.659470Z DEBUG Connection: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) } peer=Client
2024-09-25T09:32:59.659492Z DEBUG Connection: h2::codec::framed_read: received frame=Data { stream_id: StreamId(1) } peer=Client  
2024-09-25T09:32:59.659506Z DEBUG Connection: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) } peer=Client
2024-09-25T09:32:59.660903Z DEBUG tower::buffer::worker: service.ready=true processing request                                                                                                                    
2024-09-25T09:32:59.661020Z DEBUG Connection: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) } peer=Client                                                       
2024-09-25T09:32:59.661100Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(3) } peer=Client                                                                                     
2024-09-25T09:32:59.661111Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(3), flags: (0x1: END_STREAM) } peer=Client                                                           
2024-09-25T09:32:59.661419Z DEBUG Connection: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 170 } peer=Client                                                     
2024-09-25T09:32:59.661438Z DEBUG Connection: h2::codec::framed_read: received frame=Ping { ack: false, payload: [2, 4, 16, 16, 9, 14, 7, 7] } peer=Client                                                        
2024-09-25T09:32:59.661444Z DEBUG Connection: h2::codec::framed_write: send frame=Ping { ack: true, payload: [2, 4, 16, 16, 9, 14, 7, 7] } peer=Client                                                            
2024-09-25T09:32:59.664355Z DEBUG Connection: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(3), flags: (0x5: END_HEADERS | END_STREAM) } peer=Client                                       
INFO:root:finish app                                                                                                                                                                                              
Fatal Python error: This thread state must be current when releasing                                                                                                                                              
Python runtime state: finalizing (tstate=0x95d720)                                                                                                                                                                
                                                                                                                                                                                                                  
Current thread 0x00007ffedaff7700 (most recent call first):                                                                                                                                                       
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 140 in _write_to_self                                                                                                                                
  File "/usr/lib/python3.8/asyncio/base_events.py", line 770 in call_soon_threadsafe                                                                                                                              
                                                                                                                                                                                                                  
Thread 0x00007ffff7bfd740 (most recent call first):                                                                                                                                                               
<no Python frame>                                                                                                                                                                                                 
                                                                                                                                                                                                                  
Thread 79 "tokio-runtime-w" received signal SIGABRT, Aborted.

Full stack trace of failed thread: https://gist.github.com/TiunovNN/84c253d62f03f11270a49facee6c7b52

I suppose that root cause is the python does not call destructor immediately, and as written in docs A Client does not have an explicit "close". When cpython is finalizing code, it acquires GIL and executes destructor of tokio::runtime::Runtime which is expecting while all tasks have finished. At the same time tokio::runtime got ping message and run python code to handle message, which is trying to acquire GIL as well.

It would be better if there was an explicit way to close client.

cretz · 2024-09-25T12:57:25Z

Here is a simple code to reproduce bug:

I am struggling to replicate. There is a slight bug where run accepts self, but fixing that, I didn't see it. Can you confirm your OS, temporalio SDK version, and Python version?

It would be better if there was an explicit way to close client.

I don't think this would solve it if the issue is Tokio runtime cleanup. A better solution may be like what hydro-project/hydro#699 did (linked from hydro-project/hydro#619 who linked here as having a common issue). Specifically using something like atexit.register to register runtime drop.

But ideally we could get a reliable replication first so we can confirm it is fixed once we make the change. I am struggling to get this currently.

TiunovNN · 2024-09-25T14:00:30Z

I am struggling to replicate.

My mistake. I have forgot to mention that this bug appears not every run, but rarely. So you have to rerun script several times.
I used gdb in the following manner:

$ gdb python
(gdb) set pagination off
(gdb) break _exit
(gdb) command
> run
> end
(gdb) run script.py

And wait for several minutes.

Can you confirm your OS, temporalio SDK version, and Python version?

The bug appears on the following sets:

Ubuntu 20.04.6 LTS + python 3.12 + temporal 1.5.0
Ubuntu 22.04.5 LTS + python 3.12 + temporal 1.5.0
Ubuntu 20.04.6 LTS + python 3.12 + temporal 1.7.1
Ubuntu 22.04.5 LTS + python 3.12 + temporal 1.7.1
Ubuntu 20.04.6 LTS + python 3.8 + temporal 1.7.1

cretz · 2024-09-25T14:31:49Z

Thanks! We will increase the priority on this.

TiunovNN · 2024-10-02T08:52:46Z

@cretz

Specifically using something like atexit.register to register runtime drop

Please, take into account, that client destructor might be called even during execution process (not only at exit), when there is no references to the object. I would recommend to use context manager as appropriate solution for python.

cretz · 2024-10-02T13:07:23Z

that client destructor might be called even during execution process (not only at exit), when there is no references to the object.

This is not just about client destructor, this is about when it is dropped Rust side. If the client is in use in any way (Python or Rust), it is not closed/disconnected. We can't just use context manager because it is passed around (e.g. to a worker) and may only retain Rust references. I think this error is specific to the Python-Rust bridge and PyO3.

cretz · 2024-10-15T16:58:33Z

Update here, I am still struggling to replicate. I suspect this is actually a Tokio runtime lifetime issue and not specific to the client, but we are still trying to replicate reliably (running the script under gdb for me, at least on 3.10, just continually works forever).

cretz · 2024-10-15T20:14:26Z

I can replicate a similar issue changing the code to just run a task in the background and then completing the app before waiting on complete, e.g.:

import asyncio
import logging
import uuid

from temporalio.api.workflowservice.v1 import StartWorkflowExecutionRequest
from temporalio.client import Client
from temporalio.exceptions import WorkflowAlreadyStartedError
from temporalio.runtime import LoggingConfig, Runtime, TelemetryConfig, TelemetryFilter


async def connect_client(host):
    client = await Client.connect(
        host,
        namespace="default",
        runtime=Runtime(
            telemetry=TelemetryConfig(
                logging=LoggingConfig(
                    filter=TelemetryFilter(
                        core_level="DEBUG",
                        other_level="DEBUG",
                    ),
                ),
            ),
        ),
    )

    req = StartWorkflowExecutionRequest(
        namespace="default",
        workflow_id=f"my-workflow-id-{uuid.uuid4()}",
        identity="my-identity",
        request_id=str(uuid.uuid4()),
    )
    req.workflow_type.name = "my-workflow"
    req.task_queue.name = "my-task-queue"

    asyncio.create_task(client.workflow_service.start_workflow_execution(req))


def run():
    logging.basicConfig(level=logging.DEBUG, force=True)
    logging.info("start app")
    asyncio.run(connect_client("localhost:7233"))
    logging.info("finish app")


if __name__ == "__main__":
    run()

This will give:

FATAL: exception not rethrown
Aborted

My current guess is this is because pyo3-asyncio assumes Python::with_gil will succeed, but that delegates to PyGILState_Ensure which documents at https://docs.python.org/3/c-api/init.html#c.PyGILState_Ensure:

Calling this function from a thread when the runtime is finalizing will terminate the thread, even if the thread was not created by Python. You can use Py_IsFinalizing() or sys.is_finalizing() to check if the interpreter is in process of being finalized before calling this function to avoid unwanted termination.

Unfortunately Py_IsFinalizing is 3.13+ and ideally we don't want the overhead of invoking sys.is_finalizing, but we may have to (and it's not that bad considering pyo3-asyncio already uses things like call_soon_threadsafe).

Unfortunately https://github.com/awestlake87/pyo3-asyncio is mostly abandoned, https://github.com/PyO3/pyo3-async-runtimes is just a simple fork, and built-in PyO3 asyncio with Rust futures is still being worked on. So we probably would just end up vendoring what we haven't already from pyo3-asyncio (we already copy and slightly alter some code to use the current Tokio runtime instead of one big static global).

i3Cheese · 2024-11-20T18:24:54Z

@cretz
Hello! I encountered the same issue, and I would like to help fix it. Where do I start?

cretz · 2024-11-20T23:03:32Z

Assuming our guess at the issue is correct, fixing this is daunting because it probably requires rewriting/vendoring a Rust library (pyo3-asyncio). Can you confirm this only happens on process/event-loop shutdown? The best way to avoid this is to not attempt to gracefully exit the program while the client is in use somewhere.

cretz added the bug Something isn't working label Mar 22, 2023

shadaj mentioned this issue May 24, 2023

Occasional segfaults in CI hydro_cli python tests hydro-project/hydro#619

Closed

cretz mentioned this issue Apr 3, 2025

[Feature Request] Upgrade PyO3 to latest and use pyo3-async-runtimes instead of pyo3-asyncio #816

Open

cretz mentioned this issue Apr 18, 2025

Improve test flakes #832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Investigate PyGILState_Release issue for client #300

[Bug] Investigate PyGILState_Release issue for client #300

cretz commented Mar 22, 2023 •

edited

Loading

TiunovNN commented Sep 25, 2024 •

edited

Loading

cretz commented Sep 25, 2024 •

edited

Loading

TiunovNN commented Sep 25, 2024 •

edited

Loading

cretz commented Sep 25, 2024

TiunovNN commented Oct 2, 2024

cretz commented Oct 2, 2024 •

edited

Loading

cretz commented Oct 15, 2024

cretz commented Oct 15, 2024 •

edited

Loading

i3Cheese commented Nov 20, 2024

cretz commented Nov 20, 2024

[Bug] Investigate PyGILState_Release issue for client #300

[Bug] Investigate PyGILState_Release issue for client #300

Comments

cretz commented Mar 22, 2023 • edited Loading

Describe the bug

TiunovNN commented Sep 25, 2024 • edited Loading

cretz commented Sep 25, 2024 • edited Loading

TiunovNN commented Sep 25, 2024 • edited Loading

cretz commented Sep 25, 2024

TiunovNN commented Oct 2, 2024

cretz commented Oct 2, 2024 • edited Loading

cretz commented Oct 15, 2024

cretz commented Oct 15, 2024 • edited Loading

i3Cheese commented Nov 20, 2024

cretz commented Nov 20, 2024

cretz commented Mar 22, 2023 •

edited

Loading

TiunovNN commented Sep 25, 2024 •

edited

Loading

cretz commented Sep 25, 2024 •

edited

Loading

TiunovNN commented Sep 25, 2024 •

edited

Loading

cretz commented Oct 2, 2024 •

edited

Loading

cretz commented Oct 15, 2024 •

edited

Loading