LM Evaluation Harness¶
A framework for evaluating language models developed by EleutherAI.
Overview¶
The LM Evaluation Harness is designed to facilitate the integration of various API-based language models into a standardized evaluation framework. This tool allows researchers and developers to:
- Evaluate model performance on a wide range of tasks
- Compare different models using consistent metrics
- Extend the framework with custom tasks and models
Installation¶
# Basic installation
pip install lm-eval
# With additional dependencies
pip install "lm-eval[gptq,vllm]"
# For development
pip install -e ".[dev]"
Quick Start¶
# Basic usage example
import lm_eval
results = lm_eval.simple_evaluate(
model="gpt2",
tasks=["hellaswag", "mmlu"],
num_fewshot=0
)
Command Line Usage¶
Features¶
- Support for evaluating text-only and multimodal models
- Flexible API for integrating custom models and tasks
- Comprehensive benchmarking capabilities
- Caching mechanisms for faster evaluation
- Extensible framework for adding new tasks and evaluation metrics
Documentation Guide¶
Welcome to the docs for the LM Evaluation Harness! Here's what you'll find in our documentation:
- Interface - Learn about the public interface of the library, including how to evaluate via the command line or as integrated into an external library.
- Model Guide - Learn how to add a new library, API, or model type to the framework, with explanations of different evaluation approaches.
- API Guide - Extended guide on how to extend the library to new model classes served over an API.
- New Task Guide - A crash course on adding new tasks to the library.
- Task Configuration Guide - Advanced documentation on pushing the limits of task configuration that the Eval Harness supports.