EleutherAI’s lm-evaluation-harness: Architecture and Configuration

A comprehensive guide to configuration, task architecture, and model integration

Large Language Models
Author

Earl Potters

Published

March 21, 2025

EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics

EleutherAI’s lm-evaluation-harness has emerged as one of the most robust and comprehensive frameworks for evaluating language models. Used by organizations including NVIDIA, Cohere, BigScience, and Mosaic ML, it serves as the backend for Hugging Face’s Open LLM Leaderboard and has been cited in hundreds of research papers.

This post explores the framework’s architecture, configuration system, and integration patterns to help you understand how to use, extend, and contribute to this powerful evaluation ecosystem.

What is lm-evaluation-harness?

The Language Model Evaluation Harness is a unified framework for testing generative language models on a wide variety of benchmarks. It ensures reproducibility by using publicly available prompts and supports customized evaluations.

Key features include:

  • Over 60 standard academic benchmarks with hundreds of subtasks
  • Support for models via transformers (including quantization via GPTQ), GPT-NeoX, and Megatron-DeepSpeed
  • Fast inference with vLLM
  • Support for commercial APIs (OpenAI, TextSynth)
  • Evaluation on adapter models (like LoRA) through PEFT
  • Support for local models and benchmarks
  • Customizable prompts and metrics

Installation Options

Basic Installation

Basic installation from source:

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Or install directly from PyPI:

pip install lm-eval

Development Installation

For development and contributing:

pip install -e ".[dev]"

Optional Dependencies

The framework provides several optional dependency groups:

# For GPTQ quantization support
pip install "lm-eval[gptq]"

# For vLLM acceleration
pip install "lm-eval[vllm]"

# For multiple optional dependencies
pip install "lm-eval[gptq,vllm]"

Environment Variables

Some functionality requires specific environment variables:

  • OPENAI_API_KEY - For evaluating OpenAI models
  • ANTHROPIC_API_KEY - For evaluating Anthropic models
  • HF_TOKEN - For accessing gated Hugging Face models or pushing results to the Hub
  • LOGLEVEL - Set to “DEBUG” for detailed logging during evaluation

Command Line Usage

The harness can be run as a command-line tool, providing a flexible interface for model evaluation:

python -m lm_eval --model hf --model_args pretrained=gpt2 --tasks hellaswag --num_fewshot 5

Or using the installed entry point:

lm-eval --model hf --model_args pretrained=gpt2 --tasks hellaswag --num_fewshot 5

Common CLI Arguments

  • --model: Specifies the model type to evaluate (e.g., “hf”, “openai”, “vllm”)
  • --model_args: Parameters for model initialization (e.g., “pretrained=gpt2,dtype=float32”)
  • --tasks: Comma-separated list of tasks or task groups (e.g., “mmlu,hellaswag”)
  • --num_fewshot: Number of few-shot examples to include (default: 0)
  • --batch_size: Batch size for evaluation (use “auto” for automatic selection)
  • --device: Device to place the model on (e.g., “cuda:0”, “cpu”)
  • --output_path: Path to save evaluation results
  • --log_samples: Save per-document outputs and inputs

For more detailed information on CLI arguments, see the interface documentation which covers additional options like:

  • --cache_requests: Can be “true”, “refresh”, or “delete” to use, regenerate, or remove the cache

  • --check_integrity: Tests each selected task to confirm integrity

  • --write_out: Prints prompt and gold target string for the first document of each task (for diagnostics)

  • --show_config: Prints the full TaskConfig contents for reproducibility

  • --include_path: Accepts a path to a folder with custom YAML task configurations

  • --system_instruction: Specifies a system instruction string to prepend to the prompt

  • --apply_chat_template: Controls whether to apply a chat template to prompts

  • --fewshot_as_multiturn: Treats few-shot examples as a multi-turn conversation

  • --predict_only: Generates model outputs without computing metrics

  • --seed: Sets random seeds for reproducibility

Python API Usage

You can also use the framework programmatically:

from lm_eval import evaluator, tasks
from lm_eval.models import get_model

model = get_model("hf", pretrained="gpt2")
results = evaluator.evaluate(model, tasks=["hellaswag"], num_fewshot=5)

For even simpler usage:

import lm_eval

results = lm_eval.simple_evaluate(
    model="gpt2",
    tasks=["hellaswag", "mmlu"],
    num_fewshot=0
)

For more advanced usage, the evaluate() function offers the core evaluation functionality, but without some of the special handling and simplification provided by simple_evaluate(). This allows you to:

  • Use custom task implementations
  • Specify task configurations via dictionaries
  • Provide a TaskManager with custom included paths
  • Integrate with your own model training loops

Model Configuration

The LM Evaluation Harness supports various model types through a unified interface. Each model type has its own configuration options.

Hugging Face Models

For standard transformers models:

lm-eval --model hf --model_args pretrained=gpt2

Additional options include:

  • dtype: Set precision (e.g., “float16”, “bfloat16”)
  • trust_remote_code: Allow custom model code (set to “true”)
  • use_accelerate: Use the Accelerate library for distributed inference
  • device_map: Control device placement (“auto”, “balanced”, etc.)

API-Based Models

For commercial API models:

# OpenAI
lm-eval --model openai-completions --model_args model=gpt-3.5-turbo-instruct

# Anthropic
lm-eval --model anthropic --model_args model=claude-2

API models typically require authentication via environment variables.

Accelerated Inference

For faster evaluation using vLLM:

lm-eval --model vllm --model_args pretrained=meta-llama/Llama-2-7b-hf

Local Server Models

For models hosted on a local server:

lm-eval --model local-completions --model_args base_url=http://localhost:8000/v1/completions

Task Configuration

Tasks in the harness are configured through YAML files, providing a declarative way to define evaluation setups.

Understanding Task YAML Structure

A basic task configuration includes:

task: task_name
dataset_path: huggingface/dataset_name
dataset_name: subset_name
training_split: train
validation_split: validation
test_split: test
doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
doc_to_target: "{{answer}}"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true

Key fields include:

  • task: Unique identifier for the task
  • dataset_path: Path to the dataset on HuggingFace Hub
  • doc_to_text: Template for input text (using Jinja2)
  • doc_to_target: Template for target output
  • metric_list: Metrics for evaluation

Multiple Choice Tasks

For multiple choice tasks, additional configuration is needed:

output_type: multiple_choice
doc_to_text: "{{question}}\nAnswer:"
doc_to_target: 2  # Index of correct answer
doc_to_choice: "{{[choice1, choice2, choice3, choice4]}}"

Using Filters

Filters allow post-processing of model outputs:

filter_list:
  - name: "extract-answer"
    filter:
      - function: "regex"
        regex_pattern: "The answer is (\\d+)"
      - function: "take_first"

Using Local Datasets

To load a local dataset for evaluation, you can specify data files in the dataset_kwargs field:

dataset_path: json
dataset_name: null
dataset_kwargs:
  data_files: /path/to/my/json

Or with files already split into separate directories:

dataset_path: arrow
dataset_kwargs:
  data_files:
    train: /path/to/arrow/train/data-00000-of-00001.arrow
    validation: /path/to/arrow/validation/data-00000-of-00001.arrow

Advanced Features

Chat Templates

For evaluating chat models with the appropriate formatting:

lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_template

This applies the model’s chat template to the prompt, essential for instruction-tuned models.

For models with multiple chat templates:

lm-eval --apply_chat_template chatml

The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. When apply_chat_template is set to True, the target delimiter is now set to an empty string instead of using the configured delimiter.

This prevents interference between chat template formatting and the default delimiter system, which is particularly important for multiple choice tasks where the template itself handles spacing.

Few-Shot as Multi-Turn Conversations

Format few-shot examples as a conversation history:

lm-eval --num_fewshot 3 --apply_chat_template --fewshot_as_multiturn

Task Groups and Benchmarks

Run multiple related tasks as a benchmark:

lm-eval --model hf --model_args pretrained=gpt2 --tasks mmlu

This runs all MMLU subtasks and provides both individual and aggregate metrics.

For creating your own group configurations, you can create a group YAML config with a group key which denotes the name of the group and a task key which lists the tasks to include. A good example is in lm_eval/tasks/mmlu/default/_mmlu.yaml.

Decontamination

Check for training data contamination:

lm-eval --model hf --model_args pretrained=gpt2 --tasks sciq

When enabled on a task, this checks for n-gram overlaps with training data.

The decontamination procedure tests model generalization by detecting whether test set data exists in the training set (contamination). OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document, using N values between 8 and 13 depending on dataset.

Caching Results

Cache evaluated results to speed up repeated runs:

lm-eval --use_cache /path/to/cache --cache_requests true

Creating Custom Tasks

Task File Structure

To create a new task:

  1. Create a YAML file in lm_eval/tasks/your_task_name.yaml
  2. Configure dataset parameters, prompt templates, and metrics
  3. Register the task with a unique name

For complex preprocessing, you can add Python functions:

process_docs: !function utils.process_docs

With a corresponding Python file:

# utils.py
def process_docs(dataset):
    def _process_doc(doc):
        # Preprocess document
        return processed_doc
    return dataset.map(_process_doc)

Writing Prompt Templates

When creating prompts, users will use doc_to_text, doc_to_target, and doc_to_choice (optional). doc_to_text defines the input string a model will be given while doc_to_target and doc_to_choice will be used to generate the target text.

doc_to_target can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, doc_to_choice must also be set with the appropriate list of possible choice strings.

For simple cases, you can enter the feature name directly:

doc_to_text: startphrase
doc_to_target: label

The evaluation harness supports the Jinja 2 templating language for writing prompts. For example:

doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"

Such that {passage} will be replaced by doc["passage"] and {question} with doc["question"] when rendering the prompt template.

Importing Prompts from Promptsource

You can load prompts from Promptsource by using the use_prompt argument:

use_prompt: "promptsource:GPT-3 Style"

If you would like to run evaluation on all prompt templates:

use_prompt: "promptsource:*"

Creating Task Filters

Filters allow you to post-process model outputs before scoring them. A full list of supported filter operations can be found in lm_eval/filters/__init__.py. Contributions of new filter types are welcome!

Multiple filter pipelines can run on the same model outputs generated in one run on a task. This enables scenarios like:

  1. Post-processing output text by truncating or extracting answers
  2. Ensembling over multiple “takes” on a document

For example, in the file lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml, the implementation emulates the setup used by Self-Consistency Improves Chain of Thought Prompting, which generates multiple chain-of-thought outputs, extracts numeric answers, and uses majority voting.

Best Practices and Common Pitfalls

  1. Tokenization Alignment
    • Verify model logits align with target token positions
    • Prevent off-by-one errors in likelihood calculation
    • Use reference implementations from HFLM as guides
  2. Template Safety
    • Escape special characters in Jinja templates
    • Validate few-shot example field consistency
    • Implement template versioning through tokenizer_name
  3. Performance Considerations
    • Implement request reordering for large evaluations
    • Utilize batch processing where supported
    • Profile memory usage during generation tasks
  4. Evaluation Validity
    • Separate few-shot and test splits
    • Audit metric implementations for task appropriateness
    • Verify chat template application through debug output
  5. Resource Management
    • Use --batch_size auto to automatically determine optimal batch size
    • For API models, set appropriate num_concurrent and timeout values
    • Consider using --limit for debugging to evaluate only a subset of documents

Adding New Models to the Framework

When implementing a new model type, all models must subclass the lm_eval.api.model.LM class, which enforces a common interface:

class MyCustomLM(LM):
    def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
        # Implementation for calculating conditional log probabilities

    def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]:
        # Implementation for calculating full-text log probabilities

    def generate_until(self, requests: list[Instance]) -> list[str]:
        # Implementation for free-form text generation

These methods support three types of requests:

  • generate_until: Generates text from the model until reaching stopping criteria
  • loglikelihood: Calculates log probability of a target string given an input
  • loglikelihood_rolling: Calculates log probability of an entire input string

To make your model usable via CLI, use the lm_eval.api.registry.register_model decorator:

from lm_eval.api.registry import register_model

@register_model("<name1>", "<name2>")
class MyCustomLM(LM):
    # Implementation

For adding chat templates, implement three additional methods:

class MyCustomLM(LM):
    @property
    def tokenizer_name(self) -> str:
        """Return the name of the model's tokenizer and/or chat template."""
        
    def chat_template(self, chat_template: Union[bool, str] = False) -> str:
        """Get the appropriate chat template string."""
        
    def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
        """Process a chat history into a string for the model."""

Practical Examples

Evaluating a Local Hugging Face Model

lm-eval --model hf \
  --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \
  --tasks mmlu,hellaswag \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/mistral-7b.json \
  --apply_chat_template

Evaluating a Quantized Model

lm-eval --model hf \
  --model_args pretrained=TheBloke/Llama-2-13B-GPTQ,gptq=true \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1

Evaluating an API Model

# Set OPENAI_API_KEY environment variable first
lm-eval --model openai-chat \
  --model_args model=gpt-4-turbo \
  --tasks mmlu,bbh \
  --num_fewshot 5 \
  --batch_size 10

Self-Consistency Evaluation

lm-eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-70b-hf \
  --tasks gsm8k-cot-self-consistency \
  --num_fewshot 8 \
  --batch_size 4 \
  --gen_kwargs temperature=0.7,top_p=0.95

Working with Vision-Language Models

The framework also supports multimodal evaluation with the HFMultimodalLM class for models like Llava and Idefics:

from lm_eval.models.hf_vlms import HFMultimodalLM

# Initialize the model
model = HFMultimodalLM(
    pretrained="llava-hf/llava-1.5-7b-hf",
    device_map="auto"
)

# Generate responses for multimodal inputs
results = model.generate_until(...)

Contributing to lm-evaluation-harness

EleutherAI welcomes contributions to improve the framework. The project follows these priorities for addressing concerns about prompting and evaluation details:

  1. Use procedures with widespread agreement among LLM trainers
  2. Follow clear and unambiguous official implementations
  3. Use procedures with widespread agreement among LLM evaluators
  4. Choose from common implementations when there’s no universal agreement, preferring those found in LLM training papers

They maintain an active Discord server with the #lm-thunderdome channel dedicated to developing this project and #release-discussion for support.

Important resources include: - Documentation pages in the docs directory - GitHub Milestones for tracking progress toward version releases - Project Board for tracking work items and feature requests - Discord discussions in the #lm-thunderdome channel

Contributing a New Task

To contribute a new task:

  1. Fork the repository
  2. Create a YAML configuration file
  3. Verify against reference implementations
  4. Add documentation and test results
  5. Submit a pull request

For first-time contributors, the team maintains a list of good first issues, which can be found on the project board or by filtering GitHub Issues.

Contributing a New Model Type

To add support for a new model type:

  1. Implement a subclass of lm_eval.api.model.LM
  2. Register your model with @register_model
  3. Implement the required interface methods
  4. Add documentation and tests
  5. Submit a pull request

Code style guidelines:

  • LM Evaluation Harness uses ruff for linting via pre-commit
  • Install dev tools via pip install lm_eval[dev] or pip install -e ".[dev]"
  • Run pre-commit install to ensure linters and checks will run upon committing

Improved Documentation with MkDocs

I’ve recently contributed to the lm-evaluation-harness project by adding MkDocs support to enhance the documentation experience. This improvement provides a more navigable and user-friendly documentation interface with automatic navigation, search functionality, and better organization of the existing documentation.

Pull Request for adding MkDocs to EleutherAI’s lm-evaluation-harness

You can see a preview of the MkDocs implementation at my fork’s documentation site. The pull request is currently open and will hopefully be merged into the main repository soon, making the documentation more accessible to new users and contributors.

The MkDocs integration preserves all the existing documentation while providing:

  • Modern, responsive documentation UI
  • Automatic navigation sidebar
  • Full-text search capabilities
  • Improved readability on mobile devices
  • Better organization of the existing documentation files

Conclusion

EleutherAI’s evaluation framework provides a standardized way to assess language model capabilities across a wide range of tasks. By separating the evaluation logic from model implementation, it enables fair comparison between different models and architectures. The declarative configuration system makes it easy to add new tasks and benchmarks, contributing to the growing ecosystem of LLM evaluation.

Whether you’re developing a new model or researching evaluation methodologies, understanding these evaluation methods is crucial for rigorous assessment of language model capabilities.

References

  1. EleutherAI lm-evaluation-harness GitHub Repository
  2. Official Task Guide
  3. New Task Guide
  4. Weights & Biases: Evaluating LLMs with EleutherAI
  5. Mozilla AI: LM Buddy Evaluation Concepts
  6. Red Hat: Evaluating Large Language Models
  7. API Guide Documentation
  8. Interface Documentation
  9. Model Guide Documentation