LM Evaluation Harness¶

A framework for evaluating language models developed by EleutherAI.

Overview¶

The LM Evaluation Harness is designed to facilitate the integration of various API-based language models into a standardized evaluation framework. This tool allows researchers and developers to:

Evaluate model performance on a wide range of tasks
Compare different models using consistent metrics
Extend the framework with custom tasks and models

Installation¶

# Basic installation
pip install lm-eval

# With additional dependencies
pip install "lm-eval[gptq,vllm]"

# For development
pip install -e ".[dev]"

Quick Start¶

# Basic usage example
import lm_eval

results = lm_eval.simple_evaluate(
    model="gpt2",
    tasks=["hellaswag", "mmlu"],
    num_fewshot=0
)

Command Line Usage¶

lm-eval --model hf --model_args pretrained=gpt2 --tasks hellaswag --num_fewshot 0

Features¶

Support for evaluating text-only and multimodal models
Flexible API for integrating custom models and tasks
Comprehensive benchmarking capabilities
Caching mechanisms for faster evaluation
Extensible framework for adding new tasks and evaluation metrics

Documentation Guide¶

Welcome to the docs for the LM Evaluation Harness! Here's what you'll find in our documentation:

Interface - Learn about the public interface of the library, including how to evaluate via the command line or as integrated into an external library.
Model Guide - Learn how to add a new library, API, or model type to the framework, with explanations of different evaluation approaches.
API Guide - Extended guide on how to extend the library to new model classes served over an API.
New Task Guide - A crash course on adding new tasks to the library.
Task Configuration Guide - Advanced documentation on pushing the limits of task configuration that the Eval Harness supports.