JSON Enforcement
deepeval
metrics allow you to use any custom LLM for evaluation, from LangChain and Llama-Index modules to Hugging Face’s Transformer models. Most of these metrics utilize these custom LLMs to generate verdicts, statements, and other types of LLM-generated responses to produce the final metric score for each test case. These generated responses are often prompted to be in JSON format.
However, the responses do not always come out as complete JSON objects, causing the deepeval metric to raise the error: “ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model,” which prevents the metric from completing the evaluation.
In GPT-4 or GPT-4o, this issue almost never occurs. However, for smaller and less powerful LLMs, this happens because prompt engineering alone is insufficient to produce a valid JSON output. Often, it’s simply a case of a missing “}” or a misspelled key. Regardless of what’s causing the invalid JSON, it’s important to find a workaround since this process stops the entire evaluation.
This guide will show you various methods to confine your LLM output by leveraging pydantic models to validate your output.
Hugging Face models
1. Install lm-format-enforcer
Begin by installing the lm-format-enforcer
package via pip:
pip install lm-format-enforcer
The LM-format-enforcer integrates a character-level parser with a tokenizer prefix tree. Unlike other libraries that enforce exact output formats, this approach allows LLMs to generate tokens that sequentially satisfy output format constraints, thus improving output quality.
2. Build your custom LLM
Create your custom LLM using the DeepEvalLLM
base class. We will be creating a custom Mistral7B LLM using Hugging Face's transformers library for evaluation.
class Mistral7B(DeepEvalBaseLLM):
def __init__(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer
def load_model(self):
return self.model
def get_model_name(self):
return "Mistral 7B"
def generate(self, prompt: str) -> str:
...
async def a_generate(self, prompt: str) -> str:
...
3. Populate the generate
method
This process involves defining an additional parameter, pydantic_model
, which takes a BaseModel
class from Pydantic. Utilize the JsonSchemaParser
class and the build_transformers_prefix_allowed_tokens_fn
function to ensure that the model's outputs strictly adhere to the defined JSON schema.
from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
...
def generate(self, prompt: str, pydantic_model: BaseModel) -> str:
hf_pipeline = pipeline(
'text-generation',
model=self.load_model(),
device_map='auto'
)
pydantic_model = JsonSchemaParser(AnswerFormat.model_json_schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(
hf_pipeline.tokenizer,
parser)
output_dict = hf_pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
return output_dict[0]['generated_text'][len(prompt):]
async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)
LM-format-enforcer
The LM-format-enforcer is compatible with any Python-based language model, including widely used frameworks like Transformers, LangChain, LlamaIndex, llama.cpp, vLLM, Haystack, NVIDIA, TensorRT-LLM, and ExLlamaV2. For detailed information about the package and advanced usage guidelines, visit the LM-format-enforcer github page.
If the pydantic_model
argument is not provided in the generate method of your custom LLM, you can still run metric evaluations. However, omitting this argument increases the likelihood of encountering invalid JSON errors during output generation.
OpenAI, Anthropic, Cohere, Gemini, Litellm models
1. Install instructor
Begin by installing the instructor
package via pip:
pip install -U instructor
Instructor is a user-friendly Python library built on top of Pydantic. It enables straightforward confinement of your LLM's output by encapsulating your LLM client within an Instructor method.
2. Build your custom LLM
Create your custom LLM using the DeepEvalLLM
base class. We will be creating a custom Gemini 1.5 LLM using the Google AI Python SDK.
class Mistral7B(DeepEvalBaseLLM):
def __init__(
self,
model_name
):
self.model_name = model_name
def get_model_name(self):
return model_name
def generate(self, prompt: str) -> str:
...
async def a_generate(self, prompt: str) -> str:
...
3. Populate the generate
method
This process involves defining an additional parameter, pydantic_model
, which takes a BaseModel
class from Pydantic. The instructor
client automatically allows you to create a structured response by defining a respone_model parameter which accepts a pydantic_model that inherits from BaseClass
.
import instructor
from pydantic import BaseModel
import google.generativeai as genai
...
def generate(self, prompt: str, pydantic_model: BaseModel) -> str:
client = instructor.from_gemini(
client=genai.GenerativeModel(
model_name="models/gemini-1.5-flash-latest", # model defaults to "gemini-pro"
),
mode=instructor.Mode.GEMINI_JSON,
)
resp = client.messages.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=pydantic_model,
)
return resp
async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)
Instructor
Instructor simplifies the process of extracting structured data, such as JSON, from LLMs including GPT-3.5, GPT-4, GPT-4-Vision, and open-source models like Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python. For more information on advanced usage or integration with other models not covered here, please consult the documentation..
How All of This Fits into Improving Evaluations
Deepeval metrics will automatically look for the pydantic_model
argument in custom LLMs. If supplied, it will use the associated pydantic
model for the task. If the pydantic_model
field is not provided, the evaluation will still run, but there is a higher chance of the evaluation not completing due to invalid JSON output from the LLM.
The pydantic_model
field should always be of type BaseModel
!
Regardless, before running evaluations, you should test your generate function to ensure that the pydantic
models are being correctly configured to prevent issues that may arise during the evaluation process. You should also be aware that there is a tradeoff in evaluation accuracy when using JSON-pydantic confinement.