Skip to main content

JSON Enforcement

deepeval metrics allow you to use any custom LLM for evaluation, from LangChain and Llama-Index modules to Hugging Face’s Transformer models. Most of these metrics utilize these custom LLMs to generate verdicts, statements, and other types of LLM-generated responses to produce the final metric score for each test case. These generated responses are often prompted to be in JSON format.

danger

However, the responses do not always come out as complete JSON objects, causing the deepeval metric to raise the error: “ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model,” which prevents the metric from completing the evaluation.

In GPT-4 or GPT-4o, this issue almost never occurs. However, for smaller and less powerful LLMs, this happens because prompt engineering alone is insufficient to produce a valid JSON output. Often, it’s simply a case of a missing “}” or a misspelled key. Regardless of what’s causing the invalid JSON, it’s important to find a workaround since this process stops the entire evaluation.

This guide will show you various methods to confine your LLM output by leveraging pydantic models to validate your output.

Hugging Face models

1. Install lm-format-enforcer

Begin by installing the lm-format-enforcer package via pip:

pip install lm-format-enforcer

The LM-format-enforcer integrates a character-level parser with a tokenizer prefix tree. Unlike other libraries that enforce exact output formats, this approach allows LLMs to generate tokens that sequentially satisfy output format constraints, thus improving output quality.

2. Build your custom LLM

Create your custom LLM using the DeepEvalLLM base class. We will be creating a custom Mistral7B LLM using Hugging Face's transformers library for evaluation.

class Mistral7B(DeepEvalBaseLLM):
def __init__(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer

def load_model(self):
return self.model

def get_model_name(self):
return "Mistral 7B"

def generate(self, prompt: str) -> str:
...

async def a_generate(self, prompt: str) -> str:
...

3. Populate the generate method

This process involves defining an additional parameter, pydantic_model, which takes a BaseModel class from Pydantic. Utilize the JsonSchemaParser class and the build_transformers_prefix_allowed_tokens_fn function to ensure that the model's outputs strictly adhere to the defined JSON schema.

from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn

...

def generate(self, prompt: str, pydantic_model: BaseModel) -> str:
hf_pipeline = pipeline(
'text-generation',
model=self.load_model(),
device_map='auto'
)
pydantic_model = JsonSchemaParser(AnswerFormat.model_json_schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(
hf_pipeline.tokenizer,
parser)
output_dict = hf_pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
return output_dict[0]['generated_text'][len(prompt):]

async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)

LM-format-enforcer

The LM-format-enforcer is compatible with any Python-based language model, including widely used frameworks like Transformers, LangChain, LlamaIndex, llama.cpp, vLLM, Haystack, NVIDIA, TensorRT-LLM, and ExLlamaV2. For detailed information about the package and advanced usage guidelines, visit the LM-format-enforcer github page.

info

If the pydantic_model argument is not provided in the generate method of your custom LLM, you can still run metric evaluations. However, omitting this argument increases the likelihood of encountering invalid JSON errors during output generation.

OpenAI, Anthropic, Cohere, Gemini, Litellm models

1. Install instructor

Begin by installing the instructor package via pip:

pip install -U instructor

Instructor is a user-friendly Python library built on top of Pydantic. It enables straightforward confinement of your LLM's output by encapsulating your LLM client within an Instructor method.

2. Build your custom LLM

Create your custom LLM using the DeepEvalLLM base class. We will be creating a custom Gemini 1.5 LLM using the Google AI Python SDK.

class Mistral7B(DeepEvalBaseLLM):
def __init__(
self,
model_name
):
self.model_name = model_name

def get_model_name(self):
return model_name

def generate(self, prompt: str) -> str:
...

async def a_generate(self, prompt: str) -> str:
...

3. Populate the generate method

This process involves defining an additional parameter, pydantic_model, which takes a BaseModel class from Pydantic. The instructor client automatically allows you to create a structured response by defining a respone_model parameter which accepts a pydantic_model that inherits from BaseClass.

import instructor
from pydantic import BaseModel
import google.generativeai as genai

...

def generate(self, prompt: str, pydantic_model: BaseModel) -> str:
client = instructor.from_gemini(
client=genai.GenerativeModel(
model_name="models/gemini-1.5-flash-latest", # model defaults to "gemini-pro"
),
mode=instructor.Mode.GEMINI_JSON,
)
resp = client.messages.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=pydantic_model,
)
return resp

async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)

Instructor

Instructor simplifies the process of extracting structured data, such as JSON, from LLMs including GPT-3.5, GPT-4, GPT-4-Vision, and open-source models like Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python. For more information on advanced usage or integration with other models not covered here, please consult the documentation..

How All of This Fits into Improving Evaluations

Deepeval metrics will automatically look for the pydantic_model argument in custom LLMs. If supplied, it will use the associated pydantic model for the task. If the pydantic_model field is not provided, the evaluation will still run, but there is a higher chance of the evaluation not completing due to invalid JSON output from the LLM.

caution

The pydantic_model field should always be of type BaseModel!

Regardless, before running evaluations, you should test your generate function to ensure that the pydantic models are being correctly configured to prevent issues that may arise during the evaluation process. You should also be aware that there is a tradeoff in evaluation accuracy when using JSON-pydantic confinement.