CounterGen
ABSTRACT
CounterGen is a framework for auditing and reducing bias in NLP models such as generative models (e.g GPT-J, T5, GPT-3 etc.) or classification models (e.g BERT). It does so by generating counterfactual datasets, evaluating NLP models, and doing direct model editing (à la ROME) to reduce bias. CounterGen is easy to use, even by those who don’t know how to code, it is applicable to any text data, including niche use cases, and it proposes concrete solutions to debias biased models.
PUBLICATION DATE
October 2022
AUTHORS
Fabien Roger, Siméon Campos
Back to REsearchBack to REsearchREAD PAPeRREAD PAPER

CounterGen is a framework for auditing and reducing bias in NLP models such as generative models (e.g GPT-J, T5, GPT-3 etc.) or classification models (e.g BERT). It does so by generating counterfactual datasets, evaluating NLP models, and doing direct model editing (à la ROME) to reduce bias. CounterGen is easy to use, even by those who don’t know how to code, it is applicable to any text data, including niche use cases, and it proposes concrete solutions to debias biased models.

To evaluate the bias of a model, we compare the output of a model between the cases where the input is about a member of a protected category to cases where it’s not. CounterGen lets you generate variations of your data by changing attributes which should not affect the model output. This makes it possible to measure the bias on the actual text your model is processing in deployment, which is not possible with traditional benchmarks.

This is packaged inside a lightweight Python module, “countergen”. It has powerful default datasets, augmentation methods, models, and metrics from the NLP literature.

To make it more accessible to the wider community, we built a website that enables people without coding experience to evaluate GPT-like models directly by writing or uploading data, augmenting it on the platform, and evaluating it on models available through an API.

Our project also leverages augmentation methods to reduce model bias through direct model editing by building on INLP and RLACE, techniques developed to make embeddings more similar across protected categories. We measure the neuron activations of a model on the augmented data, and find how to change the internal computations of your network so that it behaves in the same way no matter the protected categories to which inputs belong. This is available through a companion Python module, “countergenedit”.

The tool is designed to address two of the three types of bias that the NIST AI Risk Management Framework distinguishes, i.e systemic and computational bias. We decrease systemic bias by making several versions of the tool to ensure that even those with very little computing power and without coding experience can use it. We decrease computational bias by making it much easier for everyone to debias their model and dataset generating more diverse datasets that fit their needs.

We made our tool such that the default uses as little computing as possible, for it to be financially and environmentally sustainable over time. Indeed, our tool is usable by anyone using a Google Colab, i.e a free and easy to use tool, starting from our demonstration notebook.

Both the Python modules and the online tool are free and open sourced, and easily usable by an organization wishing to give its members easy access to bias evaluation tools. A detailed documentation as well as the first results we found using our tools are available.