Category Encoders is a scikit-learn-contrib library providing a comprehensive set of encoders for categorical variables. I authored this project but am no longer the day-to-day maintainer. It has grown into one of the most widely-used categorical encoding libraries in the Python ecosystem.
Features
- 15+ Encoding Methods: OneHot, Target, Binary, Hashing, Leave-One-Out, and more
- Scikit-learn Compatible: Full pipeline and transformer API support
- Handles Missing Values: Built-in strategies for missing data
- Feature Engineering: Advanced encodings for high-cardinality features
Installation
pip install category-encoders
Quick Start
import category_encoders as ce
# Target encoding
encoder = ce.TargetEncoder(cols=['category_column'])
X_encoded = encoder.fit_transform(X, y)
# Binary encoding for high-cardinality
encoder = ce.BinaryEncoder(cols=['high_card_column'])
X_encoded = encoder.fit_transform(X)
# Use in sklearn pipelines
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('encoder', ce.OrdinalEncoder()),
('classifier', RandomForestClassifier())
])
Academic Citation
Category Encoders was published in the Journal of Open Source Software (JOSS) and has been cited in numerous academic papers.