Skip to content

Conversation

@jax-cn
Copy link

@jax-cn jax-cn commented Dec 8, 2025

This PR introduces a new Inference Device (dev_inference) that provides an OpenAI-compatible API for local LLM inference. It also adds a SEV GPU Device (dev_sev_gpu) to support NVIDIA GPU TEE attestation, ensuring secure and verifiable inference workloads.

To support these features, the core HTTP handling logic has been extended to support Server-Sent Events (SSE) for streaming responses.

Key Changes

🚀 New Features

  • Inference Device (dev_inference.erl):
    • Implements an OpenAI-compatible API (e.g., /v1/chat/completions).
    • Manages the lifecycle of a local Python-based deterministic inference server.
    • Supports streaming responses via SSE.
  • GPU Attestation (dev_sev_gpu.erl, dev_sev_gpu):
    • Added support for generating and verifying NVIDIA SEV-SNP attestations.
    • Includes native Python scripts for interacting with the TEE environment.

🛠 Core Modifications

  • HTTP Streaming (hb_http.erl):
    • Updated reply/5 to handle stream_generator, enabling real-time token streaming for LLM responses.
    • Added proper CORS and header handling for event streams.
  • Configuration (hb_opts.erl):
    • Registered inference@1.0 and sev_gpu@1.0 devices.
    • Added default routing for /v1/.* to the local inference server.
    • Added inference_opts for model configuration (hash, name, size).
  • Storage (hb_store_lmdb.erl):
    • Exposed max_readers configuration to optimize LMDB for high-concurrency read scenarios.

🧹 Maintenance

  • Lifecycle Management: Updated hb_app:stop/1 to ensure the inference server is gracefully shut down.
  • Build System: Updated rebar.config with new profiles and hooks for setting up the inference and GPU environments.

Testing

  1. Run: Start the node with the inference profile: HB_PRINT=inference rebar3 as inference shell.
  2. Verify:
    2.1 Check /health endpoint.
curl --request GET \
  --url 'http://localhost:8734/~inference@1.0/health'

2.2 Test a completion request to /v1/chat/completions (both streaming and non-streaming).

curl --request POST \
  --url 'http://localhost:8734/~inference@1.0/chat/completions' \
  --header 'content-type: application/json' \
  --data '{
  "model": "qwen/qwen2.5-0.5b-instruct",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
}
'

2.3 Verify TEE attestation if running on supported hardware.

Jax added 30 commits November 25, 2025 21:37
commit 97e92aa
Author: jax <jax@apus.network>
Date:   Thu Jul 24 03:18:30 2025 +0000

    optimize code and add files for TC

commit 9a2c4dd
Author: jax <jax@apus.network>
Date:   Wed Jul 23 13:21:16 2025 +0000

    add more comments

commit 626d356
Author: jax <jax@apus.network>
Date:   Wed Jul 23 11:40:24 2025 +0000

    add dev_sev_gpu for gpu attestaion gernate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant