Generating Synthetic Test Data for Retrieval-Augmented Generation Systems: A Comprehensive Guide

4 min readSep 9, 2024

In the realm of machine learning and data science, retrieval-augmented generation (RAG) systems have gained significant traction. These systems enhance the output of language models by incorporating information retrieved from large datasets or context-specific documents. However, evaluating the performance of RAG systems effectively often poses a challenge, particularly when it comes to generating test data. In this detailed guide, we will explore how to create synthetic test data for RAG systems using an open-source library known as RAGAS. We’ll delve into the entire process from installation to implementation, with practical code snippets to facilitate understanding.

Before we proceed, let’s stay connected! Please consider following me on Medium, and don’t forget to connect with me on LinkedIn for a regular dose of data science and deep learning insights.” 🚀📊🤖

What is Synthetic Test Data Generation and Why is it Important?

Synthetic test data generation involves creating artificial data that mimics the characteristics of real-world data. This is particularly useful for evaluating machine learning models, including RAG systems, when real data is scarce, costly to obtain, or subject to privacy constraints.

The key benefits of synthetic data generation include:

Cost-Effectiveness: Reduces the expenses associated with data collection.
Scalability: Allows for the creation of vast amounts of data in a short timeframe.
Customization: Enables the creation of data tailored to specific scenarios or requirements.
Privacy Compliance: Avoids complications related to sensitive data.

RAGAS: The Tool for Synthetic Data Generation

RAGAS is an effective tool that streamlines the generation of synthetic test data aimed specifically at evaluating retrieval-augmented generation systems. The library supports the creation of high-quality questions and context pairs that can significantly improve the testing and evaluation processes.

Getting Started with RAGAS

To begin using RAGAS, we need to set up our Python environment. For this guide, we will use Google Colab, allowing us to leverage cloud resources without worrying about local dependencies.

Step 1: Installation

First, we will start by installing the required libraries. Open Google Colab and run the following command to install RAGAS, LangChain, and other necessary packages:

pip install ragas langchain-openai sentence-transformers xmltodict python-dotenv

This installation command pulls the most recent versions of the packages from the Python Package Index (PyPI).

Step 2: Importing Required Libraries

Once the installation is complete, we can import the necessary libraries. These libraries will facilitate various functionalities, including document loading, model invocation, and data handling.

Here’s a snippet for importing the required libraries:

import os 
from google.colab import userdata 
import pandas as pd 
from langchain_community.document_loaders import PubMedLoader 
from langchain_community.embeddings import HuggingFaceBgeEmbeddings 
from langchain_openai import ChatOpenAI 
from ragas.testset.generator import TestsetGenerator 
from ragas.testset.evolutions import simple, reasoning, multi_context

Step 3: Setting Up Environment Variables

When working with APIs, it is crucial to securely handle API keys and credentials. In this guide, we use userdata to manage our OpenAI API key. Ensure this environment variable is set up prior to utilizing the keys within the application:

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

Step 4: Declaring Language Models

We will employ ChatGPT (from OpenAI) for data generation and evaluation. You can opt for various models; here, we’ll define two models: one for generating data (data generation model) and another for critical evaluation (critic model).

data_generation_model = ChatOpenAI(model='gpt-4o-mini') critic_model = ChatOpenAI(model='gpt-4o')

Step 5: Loading Documents

For our test case, we will use biomedical literature from PubMed. We define parameters such as the query string and the maximum number of documents to load.

loader = PubMedLoader("cancer", load_max_docs=5) documents = loader.load()

Setting Up the Embedding Model

Embeddings are critical for understanding semantic meanings and improving retrieval processes. We’ll use embeddings from HuggingFace as follows:

model_name = "BAAI/bge-small-en" 
model_kwargs = {"device": "cpu"} 
encode_kwargs = {"normalize_embeddings": True}

embeddings = HuggingFaceBgeEmbeddings( 
          model_name=model_name, 
          model_kwargs=model_kwargs, 
          encode_kwargs=encode_kwargs )

Step 7: Creating a Test Set Generator Instance

RAGAS provides a TestsetGenerator class that we’ll utilize to create an instance for generating synthetic test data. We will pass in our models and the embeddings we previously defined:

generator = TestsetGenerator.from_langchain( data_generation_model, critic_model, embeddings )

Step 8: Defining Distributions for Diverse Data Generation

Diversity in generated data is crucial for robust evaluation. RAGAS allows us to specify distributions of question types: simple questions, multi-context questions, and reasoning questions. For our example, let’s set the following distribution:

distributions = { simple: 0.5, multi_context: 0.4, reasoning: 0.1 }

Note: Ensure that the sum of the values equals 1.

Step 9: Generating the Synthetic Test Set

Now we will generate our synthetic test set using the generate_with_langchain_docs method. Here, we specify how many test cases we want to produce (in this case, 5):

testset = generator.generate_with_langchain_docs(documents, 5, distributions)

Step 10: Converting the Test Set to a DataFrame

Once the test set is generated, we can convert it into a Pandas DataFrame for easier manipulation and examination:

test_df = testset.to_pandas() 
print(test_df)

Conclusion

In summary, generating synthetic test data for retrieval-augmented generation systems enhances the ability to evaluate models rigorously. With RAGAS, we can easily produce a variety of synthetic test cases tailored to specific domains, such as biomedicine. The key advantages of this process include improved efficiency in evaluation, reduced manual effort in creating test cases, and enhanced model reliability due to comprehensive testing across diverse scenarios.

The synthetic data generation process not only serves the immediate purpose of testing RAG systems but also provides a broader framework through which similar methodologies can be applied across different domains. As machine learning practitioners continue to strive for the creation of more robust models, adopting approaches like synthetic data generation will remain increasingly relevant.

For feedback, discussions, or community support, consider joining Discord servers dedicated to AI, where you can learn from others’ experiences. The AI landscape is vast, and continuous learning is key to maximizing its potential.

Feel free to experiment with the code provided in this guide, adjusting queries and parameters to suit your specific needs and applications. Happy coding!