scikit-learn: machine learning in Python

scikit-learn: machine learning in Python — scikit-learn 0.12.1 documentation

scikit-learn: machine learning in Python


Easy-to-use and general-purpose machine learning in Python

scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit scientific Python world (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems, accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.

Supervised learning
Support vector machines, linear models, naive Bayes, Gaussian processes...

Unsupervised learning
Clustering, Gaussian mixture models, manifold learning, matrix factorization, covariance...

And much more
Model selection, datasets, feature extraction... See below.

License: Open source, commercially usable: BSD license (3 clause)

Documentation for scikit-learn version 0.12.1. For other versions and printable format, see Documentation resources.
User Guide¶

    1. Installing scikit-learn
        1.1. Installing an official release
        1.2. Third party distributions of scikit-learn
        1.3. Bleeding Edge
        1.4. Testing
    2. Tutorials: From the bottom up with scikit-learn
        2.1. An Introduction to machine learning with scikit-learn
        2.2. A tutorial on statistical-learning for scientific data processing
    3. Supervised learning
        3.1. Generalized Linear Models
        3.2. Support Vector Machines
        3.3. Stochastic Gradient Descent
        3.4. Nearest Neighbors
        3.5. Gaussian Processes
        3.6. Partial Least Squares
        3.7. Naive Bayes
        3.8. Decision Trees
        3.9. Ensemble methods
        3.10. Multiclass and multilabel algorithms
        3.11. Feature selection
        3.12. Semi-Supervised
        3.13. Linear and Quadratic Discriminant Analysis
    4. Unsupervised learning
        4.1. Gaussian mixture models
        4.2. Manifold learning
        4.3. Clustering
        4.4. Decomposing signals in components (matrix factorization problems)
        4.5. Covariance estimation
        4.6. Novelty and Outlier Detection
        4.7. Hidden Markov Models
    5. Model Selection
        5.1. Cross-Validation: evaluating estimator performance
        5.2. Grid Search: setting estimator parameters
        5.3. Pipeline: chaining estimators
    6. Dataset transformations
        6.1. Preprocessing data
        6.2. Feature extraction
        6.3. Kernel Approximation
    7. Dataset loading utilities
        7.1. General dataset API
        7.2. Toy datasets
        7.3. Sample images
        7.4. Sample generators
        7.5. Datasets in svmlight / libsvm format
        7.6. The Olivetti faces dataset
        7.7. The 20 newsgroups text dataset
        7.8. Downloading datasets from the repository
        7.9. The Labeled Faces in the Wild face recognition dataset
    8. Reference
        8.1. sklearn.cluster: Clustering
        8.2. sklearn.covariance: Covariance Estimators
        8.3. sklearn.cross_validation: Cross Validation
        8.4. sklearn.datasets: Datasets
        8.5. sklearn.decomposition: Matrix Decomposition
        8.6. sklearn.ensemble: Ensemble Methods
        8.7. sklearn.feature_extraction: Feature Extraction
        8.8. sklearn.feature_selection: Feature Selection
        8.9. sklearn.gaussian_process: Gaussian Processes
        8.10. sklearn.grid_search: Grid Search
        8.11. sklearn.hmm: Hidden Markov Models
        8.12. sklearn.kernel_approximation Kernel Approximation
        8.13. sklearn.semi_supervised Semi-Supervised Learning
        8.14. sklearn.lda: Linear Discriminant Analysis
        8.15. sklearn.linear_model: Generalized Linear Models
        8.16. sklearn.manifold: Manifold Learning
        8.17. sklearn.metrics: Metrics
        8.18. sklearn.mixture: Gaussian Mixture Models
        8.19. sklearn.multiclass: Multiclass and multilabel classification
        8.20. sklearn.naive_bayes: Naive Bayes
        8.21. sklearn.neighbors: Nearest Neighbors
        8.22. sklearn.pls: Partial Least Squares
        8.23. sklearn.pipeline: Pipeline
        8.24. sklearn.preprocessing: Preprocessing and Normalization
        8.25. sklearn.qda: Quadratic Discriminant Analysis
        8.26. sklearn.svm: Support Vector Machines
        8.27. sklearn.tree: Decision Trees
        8.28. sklearn.utils: Utilities

Example Gallery¶

        General examples
        Examples based on real world datasets
        Covariance estimation
        Dataset examples
        Ensemble methods
        Tutorial exercises
        Gaussian Process for Machine Learning
        Generalized Linear Models
        Manifold learning
        Gaussian Mixture Models
        Nearest Neighbors
        Semi Supervised Classification
        Support Vector Machines
        Decision Trees


        Submitting a bug report
        Retrieving the latest code
        Contributing code
        Other ways to contribute
        Coding guidelines
        APIs of scikit-learn objects
    How to optimize for speed
        Python, Cython or C/C++?
        Profiling Python code
        Memory usage profiling
        Performance tips for the Cython developer
        Profiling compiled extensions
        Multi-core parallelism using joblib.Parallel
        A sample algorithmic trick: warm restarts for cross validation
    Utilities for Developers
        Validation Tools
        Efficient Linear Algebra & Array Operations
        Efficient Routines for Sparse Matrices
        Graph Routines
        Testing Functions
        Helper Functions
        Hash Functions
        Warnings and Exceptions
    Developers’ Tips for Debugging
        Memory errors: debugging Cython with valgrind
    About us
        Citing scikit-learn