Skip to content

[Bug] Investigate PyGILState_Release issue for client #300

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Open
cretz opened this issue Mar 22, 2023 · 10 comments
Open

[Bug] Investigate PyGILState_Release issue for client #300

cretz opened this issue Mar 22, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@cretz
Copy link
Member

cretz commented Mar 22, 2023

Describe the bug

On short-lived client-only process after execute_workflow client call completes, one user reported getting:

Fatal Python error: PyGILState_Release: thread state 0x***************** must be current when releasing
Python runtime state: finalizing (tstate=0x*****************)
Thread 0x****************** (most recent call first):
  <no Python frame>

Maybe this is caused by process death while waiting on client call complete? Try to replicate.

There is some discussion at PyO3/pyo3#1274 that predates pyo3-asyncio. Maybe I am not implementing our custom Tokio pyo3 asyncio extension properly?

It looks like we shouldn't be calling Python::with_gil in callbacks (i.e. not in Python-owned thread) for any reason, so we need to work around that. But https://pyo3.rs/main/ecosystem/async-await.html#awaiting-a-rust-future-in-python shows it used in a callback.

First thing is a replication, then we can see whether an pyo3-asyncio upgrade can help.

@TiunovNN
Copy link

TiunovNN commented Sep 25, 2024

Hi @cretz . I've came across the similar problem. Here is a simple code to reproduce bug:

import asyncio
import logging

from temporalio.client import Client
from temporalio.exceptions import WorkflowAlreadyStartedError
from temporalio.runtime import (LoggingConfig, Runtime, TelemetryConfig, TelemetryFilter)


async def connect_client(host):
    temporal_client: Client = await Client.connect(
        host,
        namespace='default',
        runtime=Runtime(
            telemetry=TelemetryConfig(
                logging=LoggingConfig(
                    filter=TelemetryFilter(
                        core_level='DEBUG',
                        other_level='DEBUG',
                    ),
                ),
            ),
        ),
    )

    try:
        await temporal_client.start_workflow('PackagerWorkflow', 'test', id='Pack',
                                             task_queue='default_queue')
    except WorkflowAlreadyStartedError:
        pass


def run(self):
    logging.basicConfig(level=logging.DEBUG, force=True)
    logging.info('start app')
    asyncio.run(connect_client('localhost:7233'))
    logging.info('finish app')

if __name__ == "__main__":
    run()

log from gdb:

2024-09-25T09:32:59.616938Z DEBUG hyper_util::client::legacy::connect::dns: resolving host="localhost"                                                                                                            
2024-09-25T09:32:59.657156Z DEBUG hyper_util::client::legacy::connect::http: connecting to [::1]:7233                                                                                                             
2024-09-25T09:32:59.657644Z DEBUG hyper_util::client::legacy::connect::http: connecting to 127.0.0.1:7233                                                                                                         
2024-09-25T09:32:59.657936Z DEBUG hyper_util::client::legacy::connect::http: connected to 127.0.0.1:7233                                                                                                          
2024-09-25T09:32:59.657996Z DEBUG h2::client: binding client connection                                                                                                                                           
2024-09-25T09:32:59.658088Z DEBUG h2::client: client connection bound                                                                                                                                             
2024-09-25T09:32:59.658129Z DEBUG h2::codec::framed_write: send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384, max_header_list_size: 16384 }
2024-09-25T09:32:59.658413Z DEBUG Connection: h2::codec::framed_write: send frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5177345 } peer=Client
2024-09-25T09:32:59.658430Z DEBUG tower::buffer::worker: service.ready=true processing request                                                                                                                    
2024-09-25T09:32:59.658655Z DEBUG Connection: h2::codec::framed_read: received frame=Settings { flags: (0x0), max_frame_size: 16384 } peer=Client      
2024-09-25T09:32:59.658704Z DEBUG Connection: h2::codec::framed_write: send frame=Settings { flags: (0x1: ACK) } peer=Client      
2024-09-25T09:32:59.658715Z DEBUG Connection: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) } peer=Client                                                       
2024-09-25T09:32:59.658763Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1) } peer=Client                                                                                     
2024-09-25T09:32:59.658773Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) } peer=Client    
2024-09-25T09:32:59.658843Z DEBUG Connection: h2::codec::framed_read: received frame=Settings { flags: (0x1: ACK) } peer=Client                                                                                   
2024-09-25T09:32:59.658872Z DEBUG Connection: h2::proto::settings: received settings ACK; applying Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384, max_header_list_s
ize: 16384 } peer=Client                                                                                                                                                                                          
2024-09-25T09:32:59.659016Z DEBUG Connection: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5 } peer=Client
2024-09-25T09:32:59.659059Z DEBUG Connection: h2::codec::framed_read: received frame=Ping { ack: false, payload: [2, 4, 16, 16, 9, 14, 7, 7] } peer=Client                 
2024-09-25T09:32:59.659073Z DEBUG Connection: h2::codec::framed_write: send frame=Ping { ack: true, payload: [2, 4, 16, 16, 9, 14, 7, 7] } peer=Client                                                            
2024-09-25T09:32:59.659470Z DEBUG Connection: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) } peer=Client
2024-09-25T09:32:59.659492Z DEBUG Connection: h2::codec::framed_read: received frame=Data { stream_id: StreamId(1) } peer=Client  
2024-09-25T09:32:59.659506Z DEBUG Connection: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) } peer=Client
2024-09-25T09:32:59.660903Z DEBUG tower::buffer::worker: service.ready=true processing request                                                                                                                    
2024-09-25T09:32:59.661020Z DEBUG Connection: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) } peer=Client                                                       
2024-09-25T09:32:59.661100Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(3) } peer=Client                                                                                     
2024-09-25T09:32:59.661111Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(3), flags: (0x1: END_STREAM) } peer=Client                                                           
2024-09-25T09:32:59.661419Z DEBUG Connection: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 170 } peer=Client                                                     
2024-09-25T09:32:59.661438Z DEBUG Connection: h2::codec::framed_read: received frame=Ping { ack: false, payload: [2, 4, 16, 16, 9, 14, 7, 7] } peer=Client                                                        
2024-09-25T09:32:59.661444Z DEBUG Connection: h2::codec::framed_write: send frame=Ping { ack: true, payload: [2, 4, 16, 16, 9, 14, 7, 7] } peer=Client                                                            
2024-09-25T09:32:59.664355Z DEBUG Connection: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(3), flags: (0x5: END_HEADERS | END_STREAM) } peer=Client                                       
INFO:root:finish app                                                                                                                                                                                              
Fatal Python error: This thread state must be current when releasing                                                                                                                                              
Python runtime state: finalizing (tstate=0x95d720)                                                                                                                                                                
                                                                                                                                                                                                                  
Current thread 0x00007ffedaff7700 (most recent call first):                                                                                                                                                       
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 140 in _write_to_self                                                                                                                                
  File "/usr/lib/python3.8/asyncio/base_events.py", line 770 in call_soon_threadsafe                                                                                                                              
                                                                                                                                                                                                                  
Thread 0x00007ffff7bfd740 (most recent call first):                                                                                                                                                               
<no Python frame>                                                                                                                                                                                                 
                                                                                                                                                                                                                  
Thread 79 "tokio-runtime-w" received signal SIGABRT, Aborted.                                            

Full stack trace of failed thread: https://gist.github.com/TiunovNN/84c253d62f03f11270a49facee6c7b52

I suppose that root cause is the python does not call destructor immediately, and as written in docs A Client does not have an explicit "close". When cpython is finalizing code, it acquires GIL and executes destructor of tokio::runtime::Runtime which is expecting while all tasks have finished. At the same time tokio::runtime got ping message and run python code to handle message, which is trying to acquire GIL as well.

It would be better if there was an explicit way to close client.

@cretz
Copy link
Member Author

cretz commented Sep 25, 2024

Here is a simple code to reproduce bug:

I am struggling to replicate. There is a slight bug where run accepts self, but fixing that, I didn't see it. Can you confirm your OS, temporalio SDK version, and Python version?

It would be better if there was an explicit way to close client.

I don't think this would solve it if the issue is Tokio runtime cleanup. A better solution may be like what hydro-project/hydro#699 did (linked from hydro-project/hydro#619 who linked here as having a common issue). Specifically using something like atexit.register to register runtime drop.

But ideally we could get a reliable replication first so we can confirm it is fixed once we make the change. I am struggling to get this currently.

@TiunovNN
Copy link

TiunovNN commented Sep 25, 2024

I am struggling to replicate.

My mistake. I have forgot to mention that this bug appears not every run, but rarely. So you have to rerun script several times.
I used gdb in the following manner:

$ gdb python
(gdb) set pagination off
(gdb) break _exit
(gdb) command
> run
> end
(gdb) run script.py

And wait for several minutes.

Can you confirm your OS, temporalio SDK version, and Python version?

The bug appears on the following sets:

  • Ubuntu 20.04.6 LTS + python 3.12 + temporal 1.5.0
  • Ubuntu 22.04.5 LTS + python 3.12 + temporal 1.5.0
  • Ubuntu 20.04.6 LTS + python 3.12 + temporal 1.7.1
  • Ubuntu 22.04.5 LTS + python 3.12 + temporal 1.7.1
  • Ubuntu 20.04.6 LTS + python 3.8 + temporal 1.7.1

@cretz
Copy link
Member Author

cretz commented Sep 25, 2024

Thanks! We will increase the priority on this.

@TiunovNN
Copy link

TiunovNN commented Oct 2, 2024

@cretz

Specifically using something like atexit.register to register runtime drop

Please, take into account, that client destructor might be called even during execution process (not only at exit), when there is no references to the object. I would recommend to use context manager as appropriate solution for python.

@cretz
Copy link
Member Author

cretz commented Oct 2, 2024

that client destructor might be called even during execution process (not only at exit), when there is no references to the object.

This is not just about client destructor, this is about when it is dropped Rust side. If the client is in use in any way (Python or Rust), it is not closed/disconnected. We can't just use context manager because it is passed around (e.g. to a worker) and may only retain Rust references. I think this error is specific to the Python-Rust bridge and PyO3.

@cretz
Copy link
Member Author

cretz commented Oct 15, 2024

Update here, I am still struggling to replicate. I suspect this is actually a Tokio runtime lifetime issue and not specific to the client, but we are still trying to replicate reliably (running the script under gdb for me, at least on 3.10, just continually works forever).

@cretz
Copy link
Member Author

cretz commented Oct 15, 2024

I can replicate a similar issue changing the code to just run a task in the background and then completing the app before waiting on complete, e.g.:

import asyncio
import logging
import uuid

from temporalio.api.workflowservice.v1 import StartWorkflowExecutionRequest
from temporalio.client import Client
from temporalio.exceptions import WorkflowAlreadyStartedError
from temporalio.runtime import LoggingConfig, Runtime, TelemetryConfig, TelemetryFilter


async def connect_client(host):
    client = await Client.connect(
        host,
        namespace="default",
        runtime=Runtime(
            telemetry=TelemetryConfig(
                logging=LoggingConfig(
                    filter=TelemetryFilter(
                        core_level="DEBUG",
                        other_level="DEBUG",
                    ),
                ),
            ),
        ),
    )

    req = StartWorkflowExecutionRequest(
        namespace="default",
        workflow_id=f"my-workflow-id-{uuid.uuid4()}",
        identity="my-identity",
        request_id=str(uuid.uuid4()),
    )
    req.workflow_type.name = "my-workflow"
    req.task_queue.name = "my-task-queue"

    asyncio.create_task(client.workflow_service.start_workflow_execution(req))


def run():
    logging.basicConfig(level=logging.DEBUG, force=True)
    logging.info("start app")
    asyncio.run(connect_client("localhost:7233"))
    logging.info("finish app")


if __name__ == "__main__":
    run()

This will give:

FATAL: exception not rethrown
Aborted

My current guess is this is because pyo3-asyncio assumes Python::with_gil will succeed, but that delegates to PyGILState_Ensure which documents at https://docs.python.org/3/c-api/init.html#c.PyGILState_Ensure:

Calling this function from a thread when the runtime is finalizing will terminate the thread, even if the thread was not created by Python. You can use Py_IsFinalizing() or sys.is_finalizing() to check if the interpreter is in process of being finalized before calling this function to avoid unwanted termination.

Unfortunately Py_IsFinalizing is 3.13+ and ideally we don't want the overhead of invoking sys.is_finalizing, but we may have to (and it's not that bad considering pyo3-asyncio already uses things like call_soon_threadsafe).

Unfortunately https://github.com/awestlake87/pyo3-asyncio is mostly abandoned, https://github.com/PyO3/pyo3-async-runtimes is just a simple fork, and built-in PyO3 asyncio with Rust futures is still being worked on. So we probably would just end up vendoring what we haven't already from pyo3-asyncio (we already copy and slightly alter some code to use the current Tokio runtime instead of one big static global).

@i3Cheese
Copy link

@cretz
Hello! I encountered the same issue, and I would like to help fix it. Where do I start?

@cretz
Copy link
Member Author

cretz commented Nov 20, 2024

Assuming our guess at the issue is correct, fixing this is daunting because it probably requires rewriting/vendoring a Rust library (pyo3-asyncio). Can you confirm this only happens on process/event-loop shutdown? The best way to avoid this is to not attempt to gracefully exit the program while the client is in use somewhere.

No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants