{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "GRjDgHBpRYyx" }, "source": [ "# **Tutorial 3: Introducción a clasificación usando Python**\n", "\n", "**Objetivo:** El objetivo de este tutorial es relacionarse con las bibliotecas necesarias para entrenar clasificadores usando Python.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "L40UwWCYXsRo" }, "source": [ "## **Herramientas**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: scikit-learn in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (1.5.0)\n", "Requirement already satisfied: pandas in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (2.2.0)\n", "Requirement already satisfied: numpy>=1.19.5 in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (from scikit-learn) (1.26.4)\n", "Requirement already satisfied: scipy>=1.6.0 in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (from scikit-learn) (1.12.0)\n", "Requirement already satisfied: joblib>=1.2.0 in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (from scikit-learn) (1.3.2)\n", "Requirement already satisfied: threadpoolctl>=3.1.0 in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (from scikit-learn) (3.2.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (from pandas) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (from pandas) (2023.4)\n", "Requirement already satisfied: tzdata>=2022.7 in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (from pandas) (2023.4)\n", "Requirement already satisfied: six>=1.5 in c:\\users\\lucas\\python-envs\\nb-env\\lib\\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n", "[notice] A new release of pip is available: 24.0 -> 24.2\n", "[notice] To update, run: python.exe -m pip install --upgrade pip\n" ] } ], "source": [ "!pip install scikit-learn pandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-09-27T20:39:54.274835Z", "start_time": "2020-09-27T20:39:54.138860Z" }, "colab": { "base_uri": "https://localhost:8080/" }, "id": "IWMguWhnXsT3", "outputId": "7efd1650-d7fc-4120-a76e-4e9e4e75adb5" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from sklearn.dummy import DummyClassifier\n", "from sklearn.svm import SVC # Support Vector Machine classifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.naive_bayes import GaussianNB # Naive bayes\n", "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "markdown", "metadata": { "id": "Q6yZ15zdxrXZ" }, "source": [ "# **Clasificación usando Python**" ] }, { "cell_type": "markdown", "metadata": { "id": "2kPqGmpAXsRp" }, "source": [ "## **Scikit-learn**" ] }, { "cell_type": "markdown", "metadata": { "id": "EKA76ef1XsRp" }, "source": [ "Hay muchas bibliotecas para hacer análisis de datos. Para este tutorial vamos a usar **scikit-learn** (http://scikit-learn.org) que contiene muchos modelos de machine learning ya instalados." ] }, { "cell_type": "markdown", "metadata": { "id": "S5yMQ83MxrXZ" }, "source": [ "## **Ejemplo: Iris Dataset**" ] }, { "cell_type": "markdown", "metadata": { "id": "PTr_cPjXxrXZ" }, "source": [ "Vamos a ocupar el dataset **iris** disponible en sklearn, que contiene 150 **instancias** (filas) de 3 **clases** diferentes de flores. El método **load_iris** permite cargar el dataset." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xF5ji6V_xrXa", "outputId": "9f33bfbb-c1f1-4fd2-c8cc-88a8c30a16f6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X:\n", " [[5.1 3.5 1.4 0.2]\n", " [4.9 3. 1.4 0.2]\n", " [4.7 3.2 1.3 0.2]\n", " [4.6 3.1 1.5 0.2]\n", " [5. 3.6 1.4 0.2]\n", " [5.4 3.9 1.7 0.4]\n", " [4.6 3.4 1.4 0.3]\n", " [5. 3.4 1.5 0.2]\n", " [4.4 2.9 1.4 0.2]\n", " [4.9 3.1 1.5 0.1]]\n", "y:\n", " [0 0 0 0 0 0 0 0 0 0]\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "\n", "X = iris.data ## datos, caracteristicas o features de cada flor.\n", "y = iris.target ## clase para cada instancia anterior.\n", "\n", "print(\"X:\\n\", X[:10]) # muestra las primeras 10 filas que corresponden a las caracteristicas de 10 flores.\n", "print(\"y:\\n\", y[:10]) # muestra las primeras 10 clases para cada una de las instancias de X" ] }, { "cell_type": "markdown", "metadata": { "id": "WRDwoS1yxrXb" }, "source": [ "Para saber cuáles son las clases:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "mQhnYp9LxrXc", "outputId": "ed7511c1-35b9-40a1-8ad4-6db1e052319a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2]\n" ] } ], "source": [ "print(iris.target) # mostramos todas las clases de X" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['setosa', 'versicolor', 'virginica'], dtype='\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)species
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " species \n", "0 setosa \n", "1 setosa \n", "2 setosa \n", "3 setosa \n", "4 setosa " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creamos un DataFrame para los datos de X\n", "iris_df = pd.DataFrame(X, columns=iris.feature_names)\n", "\n", "# Añadimos una columna con la especie de la flor\n", "iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)\n", "\n", "iris_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "zGdUX0QHbzdA" }, "source": [ "Hay 50 instancias de cada clase" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cYfdlGVGbU_x", "outputId": "e855edfc-46bc-450f-92ed-c61905de14d9" }, "outputs": [ { "data": { "text/plain": [ "species\n", "setosa 50\n", "versicolor 50\n", "virginica 50\n", "Name: count, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df['species'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Siempre es muy importante **ver la cantidad de instancias de cada clase**, para saber si el dataset está balanceado o no. Un dataset donde no se tuvo en cuenta el balanceo de clases **puede llevar a un modelo que no generaliza bien**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "De las clases de EDA recordemos que siempre es útil **entender un poco cómo se distribuyen los datos**. Para eso, vamos a graficar los datos en un scatter plot." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10, 5))\n", "plt.title('Sepal Length vs Sepal Width')\n", "plt.scatter(iris_df['sepal length (cm)'], iris_df['sepal width (cm)'], c=iris.target, cmap='viridis')\n", "plt.xlabel('Sepal Length')\n", "plt.ylabel('Sepal Width')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10, 5))\n", "plt.title('Petal Length vs Petal Width')\n", "plt.scatter(iris_df['petal length (cm)'], iris_df['petal width (cm)'], c=iris.target, cmap='viridis')\n", "plt.xlabel('Petal Length')\n", "plt.ylabel('Petal Width')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Prefacio: División de datos en entrenamiento, validación y prueba**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Siempre es muy muy muy **MUY** importante separar los datos de entrenamiento de los datos de prueba.\n", "\n", "- **Datos de entrenamiento**: Usamos estos datos para **entrenar el modelo de ML**. En este caso usaremos el **70%** de los datos.\n", "- **Datos de validación**: Usamos estos datos para **ajustar los hiperparámetros** de nuestro modelo. En este caso usaremos el **15%** de los datos.\n", "- **Datos de test**: Usamos estos datos para **evaluar nuestro modelo** en su versión final. En este caso usaremos el **15%** de los datos.\n", "\n", "**IMPORTANTE: Cómo y por qué usamos el conjunto de validación?**\n", "\n", "El conjunto de validación se usa para ajustar los hiperparámetros de nuestro modelo, o para cualquier otro ajuste que necesitemos hacer. La idea de esto es que **el conjunto de testing de verdad simulen datos que no ha visto antes**.\n", "\n", "**IMPORTANTE 2: Por qué usar stratify?**\n", "\n", "El parámetro **stratify** en el método **train_test_split** permite que al dividir los datos, **se mantenga la proporción de cada clase**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Notemos que el tipo de dato del output de la función es: \n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Primero separamos los datos de entrenamiento y validación/test\n", "X_train, X_val_and_test, y_train, y_val_and_test = train_test_split(X, y, test_size=0.7, random_state=0, stratify=y)\n", "\n", "# Luego separamos los datos de validación y pruebas 0.5 x 0.3 = 0.15\n", "X_val, X_test, y_val, y_test = train_test_split(X_val_and_test, y_val_and_test, test_size=0.5, random_state=0, stratify=y_val_and_test)\n", "\n", "print(\"Notemos que el tipo de dato del output de la función es: \", type(X_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para más detalles sobre qué hace la función **train_test_split**: [aquí](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Capitulo 1: Preprocesamiento de datos**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hay muchas técnicas de preprocesamiento de datos como **escalamiento, eliminación de outliers, manejo de valores faltantes, etc**.\n", "\n", "En este tutorial nos enfocaremos en el **escalamiento de los datos**, una técnica fundamental para el correcto funcionamiento de algunos algoritmos de machine learning.\n", "\n", "**OJO**: Super importante! Hacemos **fit del scaler con los datos de entrenamiento** y **transformamos todos los datos** con el mismo scaler." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Estándarización de datos**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La estandarización corresponde a **transformar los datos de manera que tengan media 0 y desviación estándar 1**. \n", "\n", "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "std_scaler = StandardScaler()\n", "\n", "X_train_std_scaled = std_scaler.fit_transform(X_train)\n", "\n", "X_val_std_scaled = std_scaler.transform(X_val)\n", "X_test_std_scaled = std_scaler.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def plot_before_after_scaling(X_before, X_after):\n", " plt.figure(figsize=(12, 3))\n", "\n", " plt.subplot(1, 2, 1)\n", " plt.title('Before Scaling')\n", " plt.hist(X_before[:, 0], bins=20)\n", " \n", " plt.subplot(1, 2, 2)\n", " plt.title('After Scaling')\n", " plt.hist(X_after[:, 0], bins=20)\n", " \n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_before_after_scaling(X_train, X_train_std_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Escalado Min-Max**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Otra técnica de escalado es el Min-Max Scaling, que reescala los valores al rango [0, 1].\n", "\n", "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "\n", "min_max_scaler = MinMaxScaler()\n", "\n", "X_train_min_max_scaled = min_max_scaler.fit_transform(X_train)\n", "\n", "X_val_min_max_scaled = min_max_scaler.transform(X_val)\n", "X_test_min_max_scaled = min_max_scaler.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_before_after_scaling(X_train, X_train_min_max_scaled)" ] }, { "cell_type": "markdown", "metadata": { "id": "L4UssHGwxrXe" }, "source": [ "## **Capítulo 2: Entremaniento de un clasificador**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Definimos nuestro modelo" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
DecisionTreeClassifier(max_depth=10, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "DecisionTreeClassifier(max_depth=10, random_state=0)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = DecisionTreeClassifier(criterion='gini', max_depth=10, random_state=0)\n", "clf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Entrenamos a nuestro modelo con el método **fit**. Al entrenar un modelo, se están ajustando los parámetros del modelo para que se ajusten a los datos para que pueda hacer predicciones más precisas." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "wFHhKgbSxrXf", "outputId": "9d0fdb39-e863-4bb2-d7a3-9bd1a74bf6ba" }, "outputs": [ { "data": { "text/html": [ "
DecisionTreeClassifier(max_depth=10, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "DecisionTreeClassifier(max_depth=10, random_state=0)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X_train, y_train) ## Entrenar usando X (features), y (clase)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.1. Predecimos con nuestro modelo con el método **predict**. Nos gustaría usar el modelo para predecir la clase de un nuevo dato.\n", "\n", "**OJITO**: El método predict recibe una lista de datos, y cada dato es una lista de características, osea que le damos una lista de listas. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "epm_ERtlxrXf", "outputId": "3b6b96be-3cf9-4685-a400-54325cafd1a5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Label predicho: [0]\n", "Especie predicha: setosa\n" ] } ], "source": [ "una_flor_afuera_de_mi_casa = np.array([[5.0, 3.6, 1.3, 0.25]])\n", "\n", "flor_predicha = clf.predict(una_flor_afuera_de_mi_casa)\n", "\n", "print(\"Label predicho:\", flor_predicha)\n", "print(\"Especie predicha:\", label_map[flor_predicha[0]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.2 Evaluemos nuestro modelo con el conjunto de **validation** con las métricas **accuracy**, **precision**, **recall**, **f1 score**. Este método nos dice qué tan bien se ajusta nuestro modelo a los datos." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Realizamos las predicciones de nuestros datos\n", "y_val_pred = clf.predict(X_val)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.9615384615384616\n", "Precision (micro): 0.9615384615384616\n", "Precision (macro): 0.9618736383442266\n", "Recall (micro): 0.9615384615384616\n", "Recall (macro): 0.9618736383442266\n", "F1 (micro): 0.9615384615384616\n", "F1 (macro): 0.9618736383442266\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n", "\n", "# Evaluamos qué tan bien lo hizo el modelo\n", "accuracy = accuracy_score(y_val, y_val_pred)\n", "precision_micro = precision_score(y_val, y_val_pred, average='micro')\n", "precision_macro = precision_score(y_val, y_val_pred, average='macro')\n", "recall_micro = recall_score(y_val, y_val_pred, average='micro')\n", "recall_macro = recall_score(y_val, y_val_pred, average='macro')\n", "f1_micro = f1_score(y_val, y_val_pred, average='micro')\n", "f1_macro = f1_score(y_val, y_val_pred, average='macro')\n", "\n", "print(\"Accuracy:\", accuracy)\n", "print(\"Precision (micro):\", precision_micro)\n", "print(\"Precision (macro):\", precision_macro)\n", "print(\"Recall (micro):\", recall_micro)\n", "print(\"Recall (macro):\", recall_macro)\n", "print(\"F1 (micro):\", f1_micro)\n", "print(\"F1 (macro):\", f1_macro)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "También podemos usar la función **classification_report** de sklearn para obtener un resumen más rápido de las métricas." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 17\n", " 1 0.94 0.94 0.94 17\n", " 2 0.94 0.94 0.94 18\n", "\n", " accuracy 0.96 52\n", " macro avg 0.96 0.96 0.96 52\n", "weighted avg 0.96 0.96 0.96 52\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "\n", "print(classification_report(y_val, y_val_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Además de monitorear las métricas en validation y testing, también es importante **monitorear las métricas en training**. Si las métricas en training son muy buenas pero en validation o testing no, **podría ser que el modelo esté sobreajustando**." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 15\n", " 1 1.00 1.00 1.00 15\n", " 2 1.00 1.00 1.00 15\n", "\n", " accuracy 1.00 45\n", " macro avg 1.00 1.00 1.00 45\n", "weighted avg 1.00 1.00 1.00 45\n", "\n" ] } ], "source": [ "y_train_pred = clf.predict(X_train)\n", "\n", "print(classification_report(y_train, y_train_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Probamos con algún otro parámetros y volvemos a entrenar el modelo." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 17\n", " 1 0.94 0.94 0.94 17\n", " 2 0.94 0.94 0.94 18\n", "\n", " accuracy 0.96 52\n", " macro avg 0.96 0.96 0.96 52\n", "weighted avg 0.96 0.96 0.96 52\n", "\n" ] } ], "source": [ "clf_2 = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)\n", "\n", "clf_2.fit(X_train, y_train)\n", "\n", "y_val_pred_2 = clf_2.predict(X_val)\n", "\n", "print(classification_report(y_val, y_val_pred_2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Capítulo 3: Evaluación de un clasificador**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Después de estar seguros que los hiperparámetros de nuestro modelo están bien ajustados, es hora de la evaluación final de nuestro modelo en el conjunto de test.\n", "\n", "En general si luego de la evaluación final no estamos satisfechos con el rendimiento de nuestro modelo, lo mejor sería probar con otro modelo, o volver a la parte de los datos ver si se pueden obtener más datos o más características." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1ELByhfHxrXi", "outputId": "cb7b5ea1-1f6a-4239-89a7-216eff931d6e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Métricas para el primer clasificador 1:\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 18\n", " 1 0.88 0.78 0.82 18\n", " 2 0.79 0.88 0.83 17\n", "\n", " accuracy 0.89 53\n", " macro avg 0.89 0.89 0.89 53\n", "weighted avg 0.89 0.89 0.89 53\n", "\n", "\n", "Métricas para el segundo clasificador 2:\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 18\n", " 1 0.88 0.78 0.82 18\n", " 2 0.79 0.88 0.83 17\n", "\n", " accuracy 0.89 53\n", " macro avg 0.89 0.89 0.89 53\n", "weighted avg 0.89 0.89 0.89 53\n", "\n" ] } ], "source": [ "y_test_pred = clf.predict(X_test)\n", "y_test_pred_2 = clf_2.predict(X_test)\n", "\n", "print(\"Métricas para el primer clasificador 1:\")\n", "print(classification_report(y_test, y_test_pred))\n", "print()\n", "print(\"Métricas para el segundo clasificador 2:\")\n", "print(classification_report(y_test, y_test_pred_2))" ] }, { "cell_type": "markdown", "metadata": { "id": "Lj3vDXqEAKCv" }, "source": [ "## **Capitulo 4: Matriz de confusión**" ] }, { "cell_type": "markdown", "metadata": { "id": "w85ktDOjUwpz" }, "source": [ "La matriz de confusión muestra cuantos elementos de cada clase son asignados a cada una de las posibles opciones por el clasificador." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 501 }, "id": "2F1adK5YANn_", "outputId": "2bbfe56f-7e26-4ec9-8744-a2cebcfb3ada" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "cm = confusion_matrix(y_test, y_test_pred) # calcula valores de la matriz de confusión\n", "\n", "fig, ax = plt.subplots()\n", "\n", "ax = sns.heatmap(cm, annot=True, cmap=\"Blues\") # transforma la matriz en un heatmap para su visualización\n", "\n", "ax.set_title('Confusion Matrix \\n')\n", "ax.set_xlabel('Predicted label')\n", "ax.set_ylabel('True label')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "zYJlYIYzW1Ct" }, "source": [ "## **Capitulo 5: Cross Validation**" ] }, { "cell_type": "markdown", "metadata": { "id": "uGH-Xm4HW_tR" }, "source": [ "Para realizar una evaluación más robusta del desempeño del modelo, una opción es aplicar validación cruzada. \n", "\n", "A diferencia de pasos anteriores, en cross validation usamos **todos los datos**, ya que al iterar por distintos conjuntos de datos aleatorios, también simulamos ese comportamiento de datos que no ha visto antes.\n", "\n", "Esto es posible en sklearn mediante la función `cross_validate`:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "id": "lQY2HZfuW0rB" }, "outputs": [], "source": [ "from sklearn.model_selection import cross_validate\n", "\n", "scoring = ['precision_macro', 'recall_macro', 'accuracy', 'f1_macro']\n", "\n", "cv_results = cross_validate(clf, X, y, cv=10, scoring=scoring)" ] }, { "cell_type": "markdown", "metadata": { "id": "2VwmvYoOZHyZ" }, "source": [ "`cross_validate` recibe los siguientes parámetros:\n", "\n", "- estimator: clasificador a evaluar\n", "- X: datos de entrada\n", "- y: etiquetas\n", "- cv: número de particiones en que se divide el dataset\n", "- scoring: métricas a evaluar\n", "- return_train_score: si es `True` calcula las métricas también para el entrenamiento\n", "- return_estimator: si es `True` retorna los clasificadores entrenados en cada iteración\n", "\n", "y entrega:\n", "\n", "- fit_time: tiempo que demora el entrenamiento\n", "- score_time: tiempo que demora la evaluación\n", "- test_score: métricas de evaluación para cada partición\n", "- train_score: métricas de entrenamiento para cada partición (si `return_train_score` es `True`)\n", "\n", "La función entrega un diccionario con los resultados obtenidos para cada métrica, además del tiempo que demora el entrenamiento y la evaluación." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fit_timescore_timetest_precision_macrotest_recall_macrotest_accuracytest_f1_macro
00.0010020.0029991.0000001.0000001.0000001.000000
10.0010070.0020030.9444440.9333330.9333330.932660
20.0000000.0029991.0000001.0000001.0000001.000000
30.0000000.0030070.9444440.9333330.9333330.932660
40.0010030.0019970.9444440.9333330.9333330.932660
50.0010000.0020000.8666670.8666670.8666670.866667
60.0010000.0020040.9444440.9333330.9333330.932660
70.0000000.0030001.0000001.0000001.0000001.000000
80.0000000.0030011.0000001.0000001.0000001.000000
90.0000000.0029991.0000001.0000001.0000001.000000
\n", "
" ], "text/plain": [ " fit_time score_time test_precision_macro test_recall_macro \\\n", "0 0.001002 0.002999 1.000000 1.000000 \n", "1 0.001007 0.002003 0.944444 0.933333 \n", "2 0.000000 0.002999 1.000000 1.000000 \n", "3 0.000000 0.003007 0.944444 0.933333 \n", "4 0.001003 0.001997 0.944444 0.933333 \n", "5 0.001000 0.002000 0.866667 0.866667 \n", "6 0.001000 0.002004 0.944444 0.933333 \n", "7 0.000000 0.003000 1.000000 1.000000 \n", "8 0.000000 0.003001 1.000000 1.000000 \n", "9 0.000000 0.002999 1.000000 1.000000 \n", "\n", " test_accuracy test_f1_macro \n", "0 1.000000 1.000000 \n", "1 0.933333 0.932660 \n", "2 1.000000 1.000000 \n", "3 0.933333 0.932660 \n", "4 0.933333 0.932660 \n", "5 0.866667 0.866667 \n", "6 0.933333 0.932660 \n", "7 1.000000 1.000000 \n", "8 1.000000 1.000000 \n", "9 1.000000 1.000000 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame.from_dict(cv_results)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3oKHCkefY9Sv", "outputId": "60166d0a-76e9-41fc-b87d-09c9d3f81504" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Promedio Precision: 0.9644444444444445\n", "Promedio Recall: 0.96\n", "Promedio F1-score: 0.9597306397306398\n", "Promedio Accucary: 0.96\n" ] } ], "source": [ "print('Promedio Precision:', np.mean(cv_results['test_precision_macro']))\n", "print('Promedio Recall: ', np.mean(cv_results['test_recall_macro']))\n", "print('Promedio F1-score: ', np.mean(cv_results['test_f1_macro']))\n", "print('Promedio Accucary: ', np.mean(cv_results['test_accuracy']))" ] }, { "cell_type": "markdown", "metadata": { "id": "hiDmY7aQiW9b" }, "source": [ "## **Capitulo 6: Otros clasificadores**" ] }, { "cell_type": "markdown", "metadata": { "id": "6KQP9CpJibAi" }, "source": [ "Sklearn incluye varios modelos de clasificación que pueden entrenar y evaluar de igual forma que el árbol de decisión" ] }, { "cell_type": "markdown", "metadata": { "id": "Dqx2y0mKjA1-" }, "source": [ "### **Naive Bayes**" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "aJdmvxPhjI0Z", "outputId": "80674734-5c78-4867-b341-da0b1f90883f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 18\n", " 1 0.94 0.94 0.94 18\n", " 2 0.94 0.94 0.94 17\n", "\n", " accuracy 0.96 53\n", " macro avg 0.96 0.96 0.96 53\n", "weighted avg 0.96 0.96 0.96 53\n", "\n" ] } ], "source": [ "nb_clf = GaussianNB()\n", "\n", "nb_clf.fit(X_train, y_train)\n", "\n", "y_pred = nb_clf.predict(X_test)\n", "\n", "nb_acc = accuracy_score(y_test, y_pred)\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": { "id": "7-dho-y6lcOk" }, "source": [ "### **Random Forest**" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NCbXbOMxlb0R", "outputId": "7e5793da-3f66-42c4-f523-848d84798453" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 18\n", " 1 0.89 0.94 0.92 18\n", " 2 0.94 0.88 0.91 17\n", "\n", " accuracy 0.94 53\n", " macro avg 0.94 0.94 0.94 53\n", "weighted avg 0.94 0.94 0.94 53\n", "\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "rd_clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)\n", "\n", "rd_clf.fit(X_train, y_train)\n", "\n", "y_pred = rd_clf.predict(X_test)\n", "\n", "rd_acc = accuracy_score(y_test, y_pred)\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Ahora recordemos que para los clasificadores basados en distancia, es importante **escalar los datos**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **K-Nearest Neighbors**" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YVZjoNeRkXxf", "outputId": "4e821922-2185-4464-fabf-12239bf37869" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 18\n", " 1 0.94 0.94 0.94 18\n", " 2 0.94 0.94 0.94 17\n", "\n", " accuracy 0.96 53\n", " macro avg 0.96 0.96 0.96 53\n", "weighted avg 0.96 0.96 0.96 53\n", "\n" ] } ], "source": [ "kn_clf = KNeighborsClassifier(n_neighbors=5)\n", "\n", "# Usamos los datos escalados con el StandardScaler\n", "kn_clf.fit(X_train_std_scaled, y_train)\n", "\n", "y_pred = kn_clf.predict(X_test_std_scaled)\n", "\n", "kn_acc = accuracy_score(y_test, y_pred)\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": { "id": "HycRF4kWkvUg" }, "source": [ "### **Support Vector Machine**" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bpqLn_6ik7Tx", "outputId": "7f0249e8-2624-4a79-ee89-b74342ae0a19" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 18\n", " 1 0.94 0.94 0.94 18\n", " 2 0.94 0.94 0.94 17\n", "\n", " accuracy 0.96 53\n", " macro avg 0.96 0.96 0.96 53\n", "weighted avg 0.96 0.96 0.96 53\n", "\n" ] } ], "source": [ "sv_clf = SVC(C=1.0, kernel='rbf')\n", "\n", "sv_clf.fit(X_train_std_scaled, y_train)\n", "\n", "y_pred = sv_clf.predict(X_test_std_scaled)\n", "\n", "sv_acc = accuracy_score(y_test, y_pred)\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": { "id": "mG2LFPXZo4Mm" }, "source": [ "### **Dummy**" ] }, { "cell_type": "markdown", "metadata": { "id": "_o9jhwLxtFxy" }, "source": [ "Sklearn también incluye la clase `DummyClassifier`, esta corresponde a un clasificador que realiza una clasificación trivial y se puede usar como punto de comparación para el resto de clasificadores. Si un modelo obtiene valores similares al los de `DummyClassifier`, se podría decir no ha logrado aprender de los datos." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BM3ktSULo3ED", "outputId": "6c0d8142-2840-44de-ef3c-2a797c1adbd6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.27 0.39 0.32 18\n", " 1 0.07 0.06 0.06 18\n", " 2 0.23 0.18 0.20 17\n", "\n", " accuracy 0.21 53\n", " macro avg 0.19 0.21 0.19 53\n", "weighted avg 0.19 0.21 0.19 53\n", "\n" ] } ], "source": [ "dm_clf = DummyClassifier(strategy='stratified')\n", "\n", "dm_clf.fit(X_train, y_train)\n", "\n", "y_pred = dm_clf.predict(X_test)\n", "\n", "dm_acc = accuracy_score(y_test, y_pred)\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Capitulo 7: Análisis de resultados**" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 450 }, "id": "QhcFR1pPpGtL", "outputId": "1f9bc3b1-a3d0-4dba-d98e-02abd0c662a8" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dt_acc = accuracy_score(y_test, clf.predict(X_test))\n", "\n", "fig, ax = plt.subplots(figsize=(7.5, 5))\n", "\n", "accuracies = [dt_acc, nb_acc, kn_acc, sv_acc, rd_acc]\n", "classifires = ['Decision Tree', 'Naive Bayes', 'K-Nearest', 'SVM', 'Random Forest']\n", "\n", "ax.bar(classifires, accuracies)\n", "ax.axhline(dm_acc, color='r', linestyle='--', label='dummy')\n", "\n", "ax.legend()\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Epílogo: Feature Extraction**\n", "\n", "##### [Documentación de sklearn](https://scikit-learn.org/stable/modules/feature_extraction.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Dict to Vector**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Algo común es tener datos en una lista con diccionarios, y querer convertirlos a una matriz para poder entrenar un modelo. Para esto, sklearn tiene la clase **DictVectorizer**." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "people_data = [\n", " {'name': 'John', 'age': 25, 'income': 50000},\n", " {'name': 'Linda', 'income': 80000},\n", " {'name': 'Peter', 'age': 45, 'income': 70000},\n", " {'name': 'Anna', 'age': 35},\n", "]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature names: ['age' 'income' 'name=Anna' 'name=John' 'name=Linda' 'name=Peter']\n", "Data:\n", " [[2.5e+01 5.0e+04 0.0e+00 1.0e+00 0.0e+00 0.0e+00]\n", " [0.0e+00 8.0e+04 0.0e+00 0.0e+00 1.0e+00 0.0e+00]\n", " [4.5e+01 7.0e+04 0.0e+00 0.0e+00 0.0e+00 1.0e+00]\n", " [3.5e+01 0.0e+00 1.0e+00 0.0e+00 0.0e+00 0.0e+00]]\n" ] } ], "source": [ "from sklearn.feature_extraction import DictVectorizer\n", "\n", "dv = DictVectorizer()\n", "\n", "people_data_encoded = dv.fit_transform(people_data)\n", "print(\"Feature names:\", dv.get_feature_names_out())\n", "print(\"Data:\\n\", people_data_encoded.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**IMPORTANTE: Muchas veces después del fit_trasnform se usa el método **toarray()** ya que este es más fácil de manejar.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Text to Vector**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Otro tipo de dato común es el texto. Para convertir texto a vectores veremos el **CountVectorizer** que es uno de los vectorizadores más simples de entender.\n", "\n", "**CountVectorizer** convierte un conjunto de textos en una matriz donde cada fila es un texto y cada columna es una palabra. En la fila i, columna j, se encuentra el número de veces que aparece la palabra j en el texto i.\n", "\n", "**Ej**: Si tenemos los textos \"hola mundo\" y \"mundo mundo\", la matriz resultante sería:\n", "\n", "```\n", "[[1, 1],\n", " [0, 2]]\n", "```\n", "\n", "Donde la primera columna corresponde a la palabra \"hola\" y la segunda a la palabra \"mundo\"." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "example_text = [\n", " 'The flowers are beautiful this time of year.',\n", " 'The weather is nice and sunny.',\n", " 'The flowers are blooming.',\n", " 'The sun is shining and the weather is sweet.',\n", "]" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature names: ['and' 'are' 'beautiful' 'blooming' 'flowers' 'is' 'nice' 'of' 'shining'\n", " 'sun' 'sunny' 'sweet' 'the' 'this' 'time' 'weather' 'year']\n", "Data:\n", " [[0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 1]\n", " [1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0]\n", " [0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0]\n", " [1 0 0 0 0 2 0 0 1 1 0 1 2 0 0 1 0]]\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "count_vectorizer = CountVectorizer()\n", "converted_text = count_vectorizer.fit_transform(example_text)\n", "print(\"Feature names:\", count_vectorizer.get_feature_names_out())\n", "print(\"Data:\\n\", converted_text.toarray())" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
andarebeautifulbloomingflowersisniceofshiningsunsunnysweetthethistimeweatheryear
001101001000011101
110000110001010010
201011000000010000
310000200110120010
\n", "
" ], "text/plain": [ " and are beautiful blooming flowers is nice of shining sun sunny \\\n", "0 0 1 1 0 1 0 0 1 0 0 0 \n", "1 1 0 0 0 0 1 1 0 0 0 1 \n", "2 0 1 0 1 1 0 0 0 0 0 0 \n", "3 1 0 0 0 0 2 0 0 1 1 0 \n", "\n", " sweet the this time weather year \n", "0 0 1 1 1 0 1 \n", "1 0 1 0 0 1 0 \n", "2 0 1 0 0 0 0 \n", "3 1 2 0 0 1 0 " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(converted_text.toarray(), columns=count_vectorizer.get_feature_names_out())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Spinn Off: Pipelines, la profecía**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Los pipelines son una forma de encadenar varios procesos de preprocesamiento y un modelo en un solo objeto. Esto es útil para que no se nos olvide preprocesar los datos de test de la misma forma que los de entrenamiento.\n", "\n", "https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Es importante notar que ahora que tenemos un pipeline, **no necesitamos escalar los datos manualmente** antes de entrenar, o antes de hacer predicciones, ya que el pipeline se encarga de hacerlo al llamar a los métodos **fit** y **predict**." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 15\n", " 1 0.94 1.00 0.97 15\n", " 2 1.00 0.93 0.97 15\n", "\n", " accuracy 0.98 45\n", " macro avg 0.98 0.98 0.98 45\n", "weighted avg 0.98 0.98 0.98 45\n", "\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import Pipeline\n", "\n", "X = iris.data\n", "y = iris.target\n", "\n", "# Usaremos solo train y test para simplificar el ejemplo\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)\n", "\n", "# Creamos un pipeline que primero escala los datos y luego entrena un clasificador KNN\n", "clf_pipeline = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('classifier', LogisticRegression())\n", "])\n", "\n", "\n", "# Notemos que al usar el pipeline, no necesitamos escalar los datos manualmente\n", "# ya que el pipeline se encarga de hacerlo al correr el método fit, predict, etc.\n", "clf_pipeline.fit(X_train, y_train)\n", "\n", "y_pred = clf_pipeline.predict(X_test)\n", "\n", "print(classification_report(y_test, y_pred))" ] } ], "metadata": { "colab": { "provenance": [] }, "hide_input": false, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 0 }