Crafting OpenAI-Compatible APIs with Saar Berkovich

Share

### Implementing a Mock OpenAI Chat API with FastAPI

In the realm of AI and chat applications, creating a mock version of an API can be an invaluable step for testing and development. This guide walks through the process of implementing a non-streaming and streaming mock OpenAI chat API using FastAPI, a modern, fast (high-performance) web framework for building APIs with Python 3.7+.

#### Modeling Our Request with Pydantic

We start by defining our data models using Pydantic, a data validation and settings management library. These models will help us structure the requests and responses our mock API will handle.

“`python
from typing import List, Optional
from pydantic import BaseModel

class ChatMessage(BaseModel):
role: str
content: str

class ChatCompletionRequest(BaseModel):
model: str = “mock-gpt-model”
messages: List[ChatMessage]
max_tokens: Optional[int] = 512
temperature: Optional[float] = 0.1
stream: Optional[bool] = False
“`

#### Creating the FastAPI Endpoint

Next, we implement our FastAPI endpoint to handle chat completions. This endpoint will simulate the behavior of an AI chat model by echoing the last user message or indicating when no messages are present.

“`python
import time
from fastapi import FastAPI

app = FastAPI(title=”OpenAI-compatible API”)

@app.post(“/chat/completions”)
async def chat_completions(request: ChatCompletionRequest):
if request.messages and request.messages[0].role == ‘user’:
resp_content = “As a mock AI Assistant, I can only echo your last message:” + request.messages[-1].content
else:
resp_content = “As a mock AI Assistant, I can only echo your last message, but there were no messages!”

return {
“id”: “1337”,
“object”: “chat.completion”,
“created”: time.time(),
“model”: request.model,
“choices”: [{
“message”: ChatMessage(role=”assistant”, content=resp_content)
}]
}
“`

#### Testing Our Implementation

To test the non-streaming part of our mock API, we use the FastAPI and OpenAI libraries, simulating a client-server interaction.

“`python
# Assuming both code blocks are in a file called main.py
# Install dependencies: pip install fastapi openai
# Launch the server: uvicorn main:app
“`

#### Implementing Streaming Responses

For a more realistic simulation, especially for computationally expensive LLM generation, streaming responses back to the client can be beneficial. We achieve this by creating a generator function and modifying our endpoint to return a `StreamingResponse` when streaming is requested.

“`python
import asyncio
import json
from starlette.responses import StreamingResponse

# Generator function and modified endpoint here
“`

#### Testing the Streaming Implementation

After implementing the streaming functionality, we test it by simulating a client request that specifies streaming, observing the gradual output that mimics token generation.

#### Putting It All Together

Finally, we consolidate our code, demonstrating a complete setup for a mock OpenAI-compatible API server that can handle both non-streaming and streaming chat completion requests.

This guide provides a foundation for simulating an AI chat service, offering a sandbox for testing, development, and understanding the intricacies of working with streaming data and API requests in a Python environment.

In the rapidly evolving world of artificial intelligence and machine learning, developers and researchers are constantly seeking ways to improve the interaction between humans and AI. One of the latest advancements in this field is the development of more responsive and interactive chat APIs, which aim to make conversations with AI as seamless and natural as possible. Today, we’re diving into an innovative approach to enhancing these interactions through streaming responses, using a mock-up example that closely mirrors the functionality provided by OpenAI’s GPT models.

Our journey begins with the implementation of a non-streaming chat completion request. Utilizing PyDantic, a data validation and settings management library, we define our request model to include essential parameters such as the model name, a list of chat messages, maximum token count, and temperature. This setup aims to replicate the API reference provided by OpenAI, focusing on the core elements necessary for a basic interaction.

“`python
from typing import List, Optional
from pydantic import BaseModel

class ChatMessage(BaseModel):
role: str
content: str

class ChatCompletionRequest(BaseModel):
model: str = “mock-gpt-model”
messages: List[ChatMessage]
max_tokens: Optional[int] = 512
temperature: Optional[float] = 0.1
stream: Optional[bool] = False
“`

Following the model definition, we proceed to create a FastAPI endpoint that handles chat completions. This endpoint checks if the first message in the request is from a user and responds accordingly. If the `stream` parameter is set to `True`, the response is streamed back to the client, enhancing the user experience by providing immediate feedback as the AI generates its response.

“`python
import time
from fastapi import FastAPI
from starlette.responses import StreamingResponse

app = FastAPI(title=”OpenAI-compatible API”)

@app.post(“/chat/completions”)
async def chat_completions(request: ChatCompletionRequest):
if request.messages:
resp_content = “As a mock AI Assistant, I can only echo your last message:” + request.messages[-1].content
else:
resp_content = “As a mock AI Assistant, I can only echo your last message, but there wasn’t one!”
if request.stream:
return StreamingResponse(_resp_async_generator(resp_content), media_type=”application/x-ndjson”)
return {
“id”: “1337”,
“object”: “chat.completion”,
“created”: time.time(),
“model”: request.model,
“choices”: [{
“message”: ChatMessage(role=”assistant”, content=resp_content)
}]
}
“`

To test our implementation, we simulate a client-server interaction where the client sends a chat completion request to our FastAPI server. Using OpenAI’s Python client library as a reference, we connect to our local server and send a request, observing how the server handles both non-streaming and streaming responses.

“`python
from openai import OpenAI

# init client and connect to localhost server
client = OpenAI(
api_key=”fake-api-key”,
base_url=”http://localhost:8000″ # change the default port if needed
)

# call API for a non-streaming response
chat_completion = client.chat.completions.create(
messages=[{“role”: “user”, “content”: “Say this is a test”}],
model=”gpt-1337-turbo-pro-max”,
)

# print the top “choice”
print(chat_completion.choices[0].message.content)
“`

For the streaming implementation, we modify our endpoint to return a `StreamingResponse` when the `stream` parameter is `True`. This allows the client to receive the response incrementally, enhancing the interactivity of the chat experience.

“`python
# call API for a streaming response
stream = client.chat.completions.create(
model=”mock-gpt-model”,
messages=[{“role”: “user”, “content”: “Say this is a test”}],
stream=True,
)

for chunk in stream:
print(chunk.choices[0].delta.content or “”)
“`

This approach to handling chat completions not only improves the user experience by providing immediate feedback but also showcases the flexibility and power of modern AI APIs. By leveraging streaming responses, developers can create more engaging and interactive applications that closely mimic human conversation dynamics. As we continue to explore and innovate in this space, the possibilities for enhancing human-AI interaction seem limitless.

Read more

Related Updates