EleutherAI’s lm-evaluation-harness has emerged as one of the most robust and comprehensive frameworks for evaluating language models. Used by organizations including NVIDIA, Cohere, BigScience, and Mosaic ML, it serves as the backend for Hugging Face’s Open LLM Leaderboard and has been cited in hundreds of research papers.
This post explores the framework’s architecture, configuration system, and integration patterns to help you understand how to use, extend, and contribute to this powerful evaluation ecosystem.
What is lm-evaluation-harness?
The Language Model Evaluation Harness is a unified framework for testing generative language models on a wide variety of benchmarks. It ensures reproducibility by using publicly available prompts and supports customized evaluations.
Key features include:
- Over 60 standard academic benchmarks with hundreds of subtasks
- Support for models via transformers (including quantization via GPTQ), GPT-NeoX, and Megatron-DeepSpeed
- Fast inference with vLLM
- Support for commercial APIs (OpenAI, TextSynth)
- Evaluation on adapter models (like LoRA) through PEFT
- Support for local models and benchmarks
- Customizable prompts and metrics
Installation Options
Basic Installation
Basic installation from source:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
Or install directly from PyPI:
pip install lm-eval
Development Installation
For development and contributing:
pip install -e ".[dev]"
Optional Dependencies
The framework provides several optional dependency groups:
# For GPTQ quantization support
pip install "lm-eval[gptq]"
# For vLLM acceleration
pip install "lm-eval[vllm]"
# For multiple optional dependencies
pip install "lm-eval[gptq,vllm]"
Environment Variables
Some functionality requires specific environment variables:
OPENAI_API_KEY
- For evaluating OpenAI modelsANTHROPIC_API_KEY
- For evaluating Anthropic modelsHF_TOKEN
- For accessing gated Hugging Face models or pushing results to the HubLOGLEVEL
- Set to “DEBUG” for detailed logging during evaluation
Command Line Usage
The harness can be run as a command-line tool, providing a flexible interface for model evaluation:
python -m lm_eval --model hf --model_args pretrained=gpt2 --tasks hellaswag --num_fewshot 5
Or using the installed entry point:
lm-eval --model hf --model_args pretrained=gpt2 --tasks hellaswag --num_fewshot 5
Common CLI Arguments
--model
: Specifies the model type to evaluate (e.g., “hf”, “openai”, “vllm”)--model_args
: Parameters for model initialization (e.g., “pretrained=gpt2,dtype=float32”)--tasks
: Comma-separated list of tasks or task groups (e.g., “mmlu,hellaswag”)--num_fewshot
: Number of few-shot examples to include (default: 0)--batch_size
: Batch size for evaluation (use “auto” for automatic selection)--device
: Device to place the model on (e.g., “cuda:0”, “cpu”)--output_path
: Path to save evaluation results--log_samples
: Save per-document outputs and inputs
For more detailed information on CLI arguments, see the interface documentation which covers additional options like:
--cache_requests
: Can be “true”, “refresh”, or “delete” to use, regenerate, or remove the cache--check_integrity
: Tests each selected task to confirm integrity--write_out
: Prints prompt and gold target string for the first document of each task (for diagnostics)--show_config
: Prints the full TaskConfig contents for reproducibility--include_path
: Accepts a path to a folder with custom YAML task configurations--system_instruction
: Specifies a system instruction string to prepend to the prompt--apply_chat_template
: Controls whether to apply a chat template to prompts--fewshot_as_multiturn
: Treats few-shot examples as a multi-turn conversation--predict_only
: Generates model outputs without computing metrics--seed
: Sets random seeds for reproducibility
Python API Usage
You can also use the framework programmatically:
from lm_eval import evaluator, tasks
from lm_eval.models import get_model
= get_model("hf", pretrained="gpt2")
model = evaluator.evaluate(model, tasks=["hellaswag"], num_fewshot=5) results
For even simpler usage:
import lm_eval
= lm_eval.simple_evaluate(
results ="gpt2",
model=["hellaswag", "mmlu"],
tasks=0
num_fewshot )
For more advanced usage, the evaluate()
function offers the core evaluation functionality, but without some of the special handling and simplification provided by simple_evaluate()
. This allows you to:
- Use custom task implementations
- Specify task configurations via dictionaries
- Provide a TaskManager with custom included paths
- Integrate with your own model training loops
Model Configuration
The LM Evaluation Harness supports various model types through a unified interface. Each model type has its own configuration options.
Hugging Face Models
For standard transformers models:
lm-eval --model hf --model_args pretrained=gpt2
Additional options include:
dtype
: Set precision (e.g., “float16”, “bfloat16”)trust_remote_code
: Allow custom model code (set to “true”)use_accelerate
: Use the Accelerate library for distributed inferencedevice_map
: Control device placement (“auto”, “balanced”, etc.)
API-Based Models
For commercial API models:
# OpenAI
lm-eval --model openai-completions --model_args model=gpt-3.5-turbo-instruct
# Anthropic
lm-eval --model anthropic --model_args model=claude-2
API models typically require authentication via environment variables.
Accelerated Inference
For faster evaluation using vLLM:
lm-eval --model vllm --model_args pretrained=meta-llama/Llama-2-7b-hf
Local Server Models
For models hosted on a local server:
lm-eval --model local-completions --model_args base_url=http://localhost:8000/v1/completions
Task Configuration
Tasks in the harness are configured through YAML files, providing a declarative way to define evaluation setups.
Understanding Task YAML Structure
A basic task configuration includes:
task: task_name
dataset_path: huggingface/dataset_name
dataset_name: subset_name
training_split: train
validation_split: validation
test_split: test
doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
doc_to_target: "{{answer}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
Key fields include:
task
: Unique identifier for the taskdataset_path
: Path to the dataset on HuggingFace Hubdoc_to_text
: Template for input text (using Jinja2)doc_to_target
: Template for target outputmetric_list
: Metrics for evaluation
Multiple Choice Tasks
For multiple choice tasks, additional configuration is needed:
output_type: multiple_choice
doc_to_text: "{{question}}\nAnswer:"
doc_to_target: 2 # Index of correct answer
doc_to_choice: "{{[choice1, choice2, choice3, choice4]}}"
Using Filters
Filters allow post-processing of model outputs:
filter_list:
- name: "extract-answer"
filter:
- function: "regex"
regex_pattern: "The answer is (\\d+)"
- function: "take_first"
Using Local Datasets
To load a local dataset for evaluation, you can specify data files in the dataset_kwargs
field:
dataset_path: json
dataset_name: null
dataset_kwargs:
data_files: /path/to/my/json
Or with files already split into separate directories:
dataset_path: arrow
dataset_kwargs:
data_files:
train: /path/to/arrow/train/data-00000-of-00001.arrow
validation: /path/to/arrow/validation/data-00000-of-00001.arrow
Advanced Features
Chat Templates
For evaluating chat models with the appropriate formatting:
lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_template
This applies the model’s chat template to the prompt, essential for instruction-tuned models.
For models with multiple chat templates:
lm-eval --apply_chat_template chatml
The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. When apply_chat_template
is set to True
, the target delimiter is now set to an empty string instead of using the configured delimiter.
This prevents interference between chat template formatting and the default delimiter system, which is particularly important for multiple choice tasks where the template itself handles spacing.
Few-Shot as Multi-Turn Conversations
Format few-shot examples as a conversation history:
lm-eval --num_fewshot 3 --apply_chat_template --fewshot_as_multiturn
Task Groups and Benchmarks
Run multiple related tasks as a benchmark:
lm-eval --model hf --model_args pretrained=gpt2 --tasks mmlu
This runs all MMLU subtasks and provides both individual and aggregate metrics.
For creating your own group configurations, you can create a group YAML config with a group
key which denotes the name of the group and a task
key which lists the tasks to include. A good example is in lm_eval/tasks/mmlu/default/_mmlu.yaml
.
Decontamination
Check for training data contamination:
lm-eval --model hf --model_args pretrained=gpt2 --tasks sciq
When enabled on a task, this checks for n-gram overlaps with training data.
The decontamination procedure tests model generalization by detecting whether test set data exists in the training set (contamination). OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document, using N values between 8 and 13 depending on dataset.
Caching Results
Cache evaluated results to speed up repeated runs:
lm-eval --use_cache /path/to/cache --cache_requests true
Creating Custom Tasks
Task File Structure
To create a new task:
- Create a YAML file in
lm_eval/tasks/your_task_name.yaml
- Configure dataset parameters, prompt templates, and metrics
- Register the task with a unique name
For complex preprocessing, you can add Python functions:
process_docs: !function utils.process_docs
With a corresponding Python file:
# utils.py
def process_docs(dataset):
def _process_doc(doc):
# Preprocess document
return processed_doc
return dataset.map(_process_doc)
Writing Prompt Templates
When creating prompts, users will use doc_to_text
, doc_to_target
, and doc_to_choice
(optional). doc_to_text
defines the input string a model will be given while doc_to_target
and doc_to_choice
will be used to generate the target text.
doc_to_target
can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, doc_to_choice
must also be set with the appropriate list of possible choice strings.
For simple cases, you can enter the feature name directly:
doc_to_text: startphrase
doc_to_target: label
The evaluation harness supports the Jinja 2 templating language for writing prompts. For example:
doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
Such that {passage}
will be replaced by doc["passage"]
and {question}
with doc["question"]
when rendering the prompt template.
Importing Prompts from Promptsource
You can load prompts from Promptsource by using the use_prompt
argument:
use_prompt: "promptsource:GPT-3 Style"
If you would like to run evaluation on all prompt templates:
use_prompt: "promptsource:*"
Creating Task Filters
Filters allow you to post-process model outputs before scoring them. A full list of supported filter operations can be found in lm_eval/filters/__init__.py
. Contributions of new filter types are welcome!
Multiple filter pipelines can run on the same model outputs generated in one run on a task. This enables scenarios like:
- Post-processing output text by truncating or extracting answers
- Ensembling over multiple “takes” on a document
For example, in the file lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
, the implementation emulates the setup used by Self-Consistency Improves Chain of Thought Prompting, which generates multiple chain-of-thought outputs, extracts numeric answers, and uses majority voting.
Best Practices and Common Pitfalls
- Tokenization Alignment
- Verify model logits align with target token positions
- Prevent off-by-one errors in likelihood calculation
- Use reference implementations from
HFLM
as guides
- Template Safety
- Escape special characters in Jinja templates
- Validate few-shot example field consistency
- Implement template versioning through
tokenizer_name
- Performance Considerations
- Implement request reordering for large evaluations
- Utilize batch processing where supported
- Profile memory usage during generation tasks
- Evaluation Validity
- Separate few-shot and test splits
- Audit metric implementations for task appropriateness
- Verify chat template application through debug output
- Resource Management
- Use
--batch_size auto
to automatically determine optimal batch size - For API models, set appropriate
num_concurrent
and timeout values - Consider using
--limit
for debugging to evaluate only a subset of documents
- Use
Adding New Models to the Framework
When implementing a new model type, all models must subclass the lm_eval.api.model.LM
class, which enforces a common interface:
class MyCustomLM(LM):
def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
# Implementation for calculating conditional log probabilities
def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]:
# Implementation for calculating full-text log probabilities
def generate_until(self, requests: list[Instance]) -> list[str]:
# Implementation for free-form text generation
These methods support three types of requests:
generate_until
: Generates text from the model until reaching stopping criterialoglikelihood
: Calculates log probability of a target string given an inputloglikelihood_rolling
: Calculates log probability of an entire input string
To make your model usable via CLI, use the lm_eval.api.registry.register_model
decorator:
from lm_eval.api.registry import register_model
@register_model("<name1>", "<name2>")
class MyCustomLM(LM):
# Implementation
For adding chat templates, implement three additional methods:
class MyCustomLM(LM):
@property
def tokenizer_name(self) -> str:
"""Return the name of the model's tokenizer and/or chat template."""
def chat_template(self, chat_template: Union[bool, str] = False) -> str:
"""Get the appropriate chat template string."""
def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
"""Process a chat history into a string for the model."""
Practical Examples
Evaluating a Local Hugging Face Model
lm-eval --model hf \
--model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \
--tasks mmlu,hellaswag \
--num_fewshot 5 \
--batch_size auto \
--output_path results/mistral-7b.json \
--apply_chat_template
Evaluating a Quantized Model
lm-eval --model hf \
--model_args pretrained=TheBloke/Llama-2-13B-GPTQ,gptq=true \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 1
Evaluating an API Model
# Set OPENAI_API_KEY environment variable first
lm-eval --model openai-chat \
--model_args model=gpt-4-turbo \
--tasks mmlu,bbh \
--num_fewshot 5 \
--batch_size 10
Self-Consistency Evaluation
lm-eval --model hf \
--model_args pretrained=meta-llama/Llama-2-70b-hf \
--tasks gsm8k-cot-self-consistency \
--num_fewshot 8 \
--batch_size 4 \
--gen_kwargs temperature=0.7,top_p=0.95
Working with Vision-Language Models
The framework also supports multimodal evaluation with the HFMultimodalLM
class for models like Llava and Idefics:
from lm_eval.models.hf_vlms import HFMultimodalLM
# Initialize the model
= HFMultimodalLM(
model ="llava-hf/llava-1.5-7b-hf",
pretrained="auto"
device_map
)
# Generate responses for multimodal inputs
= model.generate_until(...) results
Contributing to lm-evaluation-harness
EleutherAI welcomes contributions to improve the framework. The project follows these priorities for addressing concerns about prompting and evaluation details:
- Use procedures with widespread agreement among LLM trainers
- Follow clear and unambiguous official implementations
- Use procedures with widespread agreement among LLM evaluators
- Choose from common implementations when there’s no universal agreement, preferring those found in LLM training papers
They maintain an active Discord server with the #lm-thunderdome
channel dedicated to developing this project and #release-discussion
for support.
Important resources include: - Documentation pages in the docs directory - GitHub Milestones for tracking progress toward version releases - Project Board for tracking work items and feature requests - Discord discussions in the #lm-thunderdome channel
Contributing a New Task
To contribute a new task:
- Fork the repository
- Create a YAML configuration file
- Verify against reference implementations
- Add documentation and test results
- Submit a pull request
For first-time contributors, the team maintains a list of good first issues, which can be found on the project board or by filtering GitHub Issues.
Contributing a New Model Type
To add support for a new model type:
- Implement a subclass of
lm_eval.api.model.LM
- Register your model with
@register_model
- Implement the required interface methods
- Add documentation and tests
- Submit a pull request
Code style guidelines:
- LM Evaluation Harness uses ruff for linting via pre-commit
- Install dev tools via
pip install lm_eval[dev]
orpip install -e ".[dev]"
- Run
pre-commit install
to ensure linters and checks will run upon committing
Improved Documentation with MkDocs
I’ve recently contributed to the lm-evaluation-harness project by adding MkDocs support to enhance the documentation experience. This improvement provides a more navigable and user-friendly documentation interface with automatic navigation, search functionality, and better organization of the existing documentation.
You can see a preview of the MkDocs implementation at my fork’s documentation site. The pull request is currently open and will hopefully be merged into the main repository soon, making the documentation more accessible to new users and contributors.
The MkDocs integration preserves all the existing documentation while providing:
- Modern, responsive documentation UI
- Automatic navigation sidebar
- Full-text search capabilities
- Improved readability on mobile devices
- Better organization of the existing documentation files
Conclusion
EleutherAI’s evaluation framework provides a standardized way to assess language model capabilities across a wide range of tasks. By separating the evaluation logic from model implementation, it enables fair comparison between different models and architectures. The declarative configuration system makes it easy to add new tasks and benchmarks, contributing to the growing ecosystem of LLM evaluation.
Whether you’re developing a new model or researching evaluation methodologies, understanding these evaluation methods is crucial for rigorous assessment of language model capabilities.