Cached Instances#

Overview#

GenAI Monitor by default stores only one copy of the same inference (identical input arguments) to optimize storage and prevent redundant API calls. However, there are many scenarios where you might want to preserve multiple unique outputs for the same input, such as:

Exploring the variety of possible responses with non-deterministic generation
A/B testing different outputs for the same prompt
Building datasets with diverse completions
Testing model consistency across multiple runs

How It Works#

Storage Phase#

When max_unique_instances is set to a value greater than 1:

For each new invocation with identical inputs, GenAI Monitor checks if the maximum number of unique instances has been reached
If the limit hasn't been reached, the function executes normally and the new output is stored
Once the limit is reached, GenAI Monitor switches to retrieval mode

Retrieval Phase#

When the maximum number of unique instances has been stored: 1. GenAI Monitor selects one of the previously stored outputs using round-robin selection 2. The selected output is returned without re-invoking the function 3. Each subsequent call with identical inputs receives the next stored output in sequence

Configuring Multiple Unique Instances#

GenAI Monitor provides registration API that supports the max_unique_instances parameter:

from copy import deepcopy
import io
from torch import nn
import torch

from genai_monitor.registration.api import register_class
from genai_monitor.utils.data_hashing import Jsonable
from genai_monitor.utils.model_hashing import default_model_hashing_function

import genai_monitor.auto

class DummyPytorchModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)

    def forward(self, x):
        return self.fc(x)

def model_output_to_bytes(model_output: torch.Tensor) -> bytes:
    buffer = io.BytesIO()
    torch.save(model_output, buffer)
    return buffer.getvalue()

def bytes_to_model_output(databytes: bytes) -> torch.Tensor:
    buffer = io.BytesIO(databytes)
    return torch.load(buffer, weights_only=True)

def parse_func_arguments(**kwargs) -> Jsonable:
    parsed_arguments = deepcopy(kwargs)
    for key, value in parsed_arguments.items():
        if isinstance(value, torch.Tensor):
            parsed_arguments[key] = value.tolist()
    return parsed_arguments


register_class(
    cls=DummyPytorchModel,
    inference_methods=["forward"],
    model_output_to_bytes=model_output_to_bytes,
    bytes_to_model_output=bytes_to_model_output,
    parse_inference_method_arguments=parse_func_arguments,
    model_hashing_function=default_model_hashing_function,
    max_unique_instances=3
)

model_instance = DummyPytorchModel()
x = torch.randn(1, 10)

# Below model outputs will be generated.
y1 = model_instance(x)
y2 = model_instance(x)
y3 = model_instance(x)

# Since `max_unique_instaces`=3, next generations will be retrieved from GenAI Monitor.
y4 = model_instance(x) # same as y1
y5 = model_instance(x) # same as y2
y6 = model_instance(x) # same as y3