{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "cSJxtOS0BMTh" }, "source": [ "

\n", "

Auxiliar 1 - Web Content Mining\n", "

Auxiliar: Bastián Bas A.\n", "

IN5526 - Web Intelligence\n", "

Primavera 2024\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "id": "zGISxLKtpnx4" }, "source": [ "# Selección de datos: Web Scraper" ] }, { "cell_type": "markdown", "metadata": { "id": "NVNvO7BSqjiH" }, "source": [ "## Instalaciones e importación de librerías" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qiOpk6bMqnxn", "outputId": "3eeaa3e1-4c9e-42d9-8923-450219a0fd4e" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Collecting selenium\n", " Downloading selenium-4.23.1-py3-none-any.whl.metadata (7.1 kB)\n", "Requirement already satisfied: urllib3<3,>=1.26 in /usr/local/lib/python3.10/dist-packages (from urllib3[socks]<3,>=1.26->selenium) (2.0.7)\n", "Collecting trio~=0.17 (from selenium)\n", " Downloading trio-0.26.2-py3-none-any.whl.metadata (8.6 kB)\n", "Collecting trio-websocket~=0.9 (from selenium)\n", " Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)\n", "Requirement already satisfied: certifi>=2021.10.8 in /usr/local/lib/python3.10/dist-packages (from selenium) (2024.7.4)\n", "Requirement already satisfied: typing_extensions~=4.9 in /usr/local/lib/python3.10/dist-packages (from selenium) (4.12.2)\n", "Requirement already satisfied: websocket-client~=1.8 in /usr/local/lib/python3.10/dist-packages (from selenium) (1.8.0)\n", "Requirement already satisfied: attrs>=23.2.0 in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (24.2.0)\n", "Requirement already satisfied: sortedcontainers in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (2.4.0)\n", "Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (3.7)\n", "Collecting outcome (from trio~=0.17->selenium)\n", " Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)\n", "Requirement already satisfied: sniffio>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (1.3.1)\n", "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (1.2.2)\n", "Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)\n", " Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)\n", "Requirement already satisfied: pysocks!=1.5.7,<2.0,>=1.5.6 in /usr/local/lib/python3.10/dist-packages (from urllib3[socks]<3,>=1.26->selenium) (1.7.1)\n", "Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)\n", " Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)\n", "Downloading selenium-4.23.1-py3-none-any.whl (9.4 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m9.4/9.4 MB\u001b[0m \u001b[31m60.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading trio-0.26.2-py3-none-any.whl (475 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m476.0/476.0 kB\u001b[0m \u001b[31m25.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)\n", "Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)\n", "Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)\n", "Downloading h11-0.14.0-py3-none-any.whl (58 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hInstalling collected packages: outcome, h11, wsproto, trio, trio-websocket, selenium\n", "Successfully installed h11-0.14.0 outcome-1.3.0.post0 selenium-4.23.1 trio-0.26.2 trio-websocket-0.11.1 wsproto-1.2.0\n", "Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease\n", "Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]\n", "Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]\n", "Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease\n", "Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]\n", "Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease\n", "Ign:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease\n", "Get:8 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]\n", "Get:9 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]\n", "Get:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]\n", "Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease\n", "Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,423 kB]\n", "Hit:13 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease\n", "Get:14 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [2,968 kB]\n", "Get:15 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,452 kB]\n", "Get:16 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,553 kB]\n", "Get:17 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,134 kB]\n", "Get:18 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,218 kB]\n", "Get:19 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,173 kB]\n", "Get:20 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 Packages [27.8 kB]\n", "Fetched 21.2 MB in 3s (7,063 kB/s)\n", "Reading package lists... Done\n", "W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)\n", "Reading package lists... Done\n", "Building dependency tree... Done\n", "Reading state information... Done\n", "The following additional packages will be installed:\n", " apparmor chromium-browser libfuse3-3 liblzo2-2 libudev1 snapd squashfs-tools systemd-hwe-hwdb\n", " udev\n", "Suggested packages:\n", " apparmor-profiles-extra apparmor-utils fuse3 zenity | kdialog\n", "The following NEW packages will be installed:\n", " apparmor chromium-browser chromium-chromedriver libfuse3-3 liblzo2-2 snapd squashfs-tools\n", " systemd-hwe-hwdb udev\n", "The following packages will be upgraded:\n", " libudev1\n", "1 upgraded, 9 newly installed, 0 to remove and 50 not upgraded.\n", "Need to get 28.5 MB of archives.\n", "After this operation, 118 MB of additional disk space will be used.\n", "Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 apparmor amd64 3.0.4-2ubuntu2.3 [595 kB]\n", "Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 liblzo2-2 amd64 2.10-2build3 [53.7 kB]\n", "Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 squashfs-tools amd64 1:4.5-3build1 [159 kB]\n", "Get:4 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 libudev1 amd64 249.11-0ubuntu3.12 [78.2 kB]\n", "Get:5 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 udev amd64 249.11-0ubuntu3.12 [1,557 kB]\n", "Get:6 http://archive.ubuntu.com/ubuntu jammy/main amd64 libfuse3-3 amd64 3.10.5-1build1 [81.2 kB]\n", "Get:7 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 snapd amd64 2.63+22.04ubuntu0.1 [25.9 MB]\n", "Get:8 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 chromium-browser amd64 1:85.0.4183.83-0ubuntu2.22.04.1 [49.2 kB]\n", "Get:9 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 chromium-chromedriver amd64 1:85.0.4183.83-0ubuntu2.22.04.1 [2,308 B]\n", "Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 systemd-hwe-hwdb all 249.11.5 [3,228 B]\n", "Fetched 28.5 MB in 1s (46.7 MB/s)\n", "Preconfiguring packages ...\n", "Selecting previously unselected package apparmor.\n", "(Reading database ... 123594 files and directories currently installed.)\n", "Preparing to unpack .../apparmor_3.0.4-2ubuntu2.3_amd64.deb ...\n", "Unpacking apparmor (3.0.4-2ubuntu2.3) ...\n", "Selecting previously unselected package liblzo2-2:amd64.\n", "Preparing to unpack .../liblzo2-2_2.10-2build3_amd64.deb ...\n", "Unpacking liblzo2-2:amd64 (2.10-2build3) ...\n", "Selecting previously unselected package squashfs-tools.\n", "Preparing to unpack .../squashfs-tools_1%3a4.5-3build1_amd64.deb ...\n", "Unpacking squashfs-tools (1:4.5-3build1) ...\n", "Preparing to unpack .../libudev1_249.11-0ubuntu3.12_amd64.deb ...\n", "Unpacking libudev1:amd64 (249.11-0ubuntu3.12) over (249.11-0ubuntu3.10) ...\n", "Setting up libudev1:amd64 (249.11-0ubuntu3.12) ...\n", "Selecting previously unselected package udev.\n", "(Reading database ... 123802 files and directories currently installed.)\n", "Preparing to unpack .../udev_249.11-0ubuntu3.12_amd64.deb ...\n", "Unpacking udev (249.11-0ubuntu3.12) ...\n", "Selecting previously unselected package libfuse3-3:amd64.\n", "Preparing to unpack .../libfuse3-3_3.10.5-1build1_amd64.deb ...\n", "Unpacking libfuse3-3:amd64 (3.10.5-1build1) ...\n", "Selecting previously unselected package snapd.\n", "Preparing to unpack .../snapd_2.63+22.04ubuntu0.1_amd64.deb ...\n", "Unpacking snapd (2.63+22.04ubuntu0.1) ...\n", "Setting up apparmor (3.0.4-2ubuntu2.3) ...\n", "Created symlink /etc/systemd/system/sysinit.target.wants/apparmor.service → /lib/systemd/system/apparmor.service.\n", "Setting up liblzo2-2:amd64 (2.10-2build3) ...\n", "Setting up squashfs-tools (1:4.5-3build1) ...\n", "Setting up udev (249.11-0ubuntu3.12) ...\n", "invoke-rc.d: could not determine current runlevel\n", "invoke-rc.d: policy-rc.d denied execution of start.\n", "Setting up libfuse3-3:amd64 (3.10.5-1build1) ...\n", "Setting up snapd (2.63+22.04ubuntu0.1) ...\n", "Created symlink /etc/systemd/system/multi-user.target.wants/snapd.apparmor.service → /lib/systemd/system/snapd.apparmor.service.\n", "Created symlink /etc/systemd/system/multi-user.target.wants/snapd.autoimport.service → /lib/systemd/system/snapd.autoimport.service.\n", "Created symlink /etc/systemd/system/multi-user.target.wants/snapd.core-fixup.service → /lib/systemd/system/snapd.core-fixup.service.\n", "Created symlink /etc/systemd/system/multi-user.target.wants/snapd.recovery-chooser-trigger.service → /lib/systemd/system/snapd.recovery-chooser-trigger.service.\n", "Created symlink /etc/systemd/system/multi-user.target.wants/snapd.seeded.service → /lib/systemd/system/snapd.seeded.service.\n", "Created symlink /etc/systemd/system/cloud-final.service.wants/snapd.seeded.service → /lib/systemd/system/snapd.seeded.service.\n", "Unit /lib/systemd/system/snapd.seeded.service is added as a dependency to a non-existent unit cloud-final.service.\n", "Created symlink /etc/systemd/system/multi-user.target.wants/snapd.service → /lib/systemd/system/snapd.service.\n", "Created symlink /etc/systemd/system/timers.target.wants/snapd.snap-repair.timer → /lib/systemd/system/snapd.snap-repair.timer.\n", "Created symlink /etc/systemd/system/sockets.target.wants/snapd.socket → /lib/systemd/system/snapd.socket.\n", "Created symlink /etc/systemd/system/final.target.wants/snapd.system-shutdown.service → /lib/systemd/system/snapd.system-shutdown.service.\n", "Selecting previously unselected package chromium-browser.\n", "(Reading database ... 124032 files and directories currently installed.)\n", "Preparing to unpack .../chromium-browser_1%3a85.0.4183.83-0ubuntu2.22.04.1_amd64.deb ...\n", "=> Installing the chromium snap\n", "==> Checking connectivity with the snap store\n", "===> System doesn't have a working snapd, skipping\n", "Unpacking chromium-browser (1:85.0.4183.83-0ubuntu2.22.04.1) ...\n", "Selecting previously unselected package chromium-chromedriver.\n", "Preparing to unpack .../chromium-chromedriver_1%3a85.0.4183.83-0ubuntu2.22.04.1_amd64.deb ...\n", "Unpacking chromium-chromedriver (1:85.0.4183.83-0ubuntu2.22.04.1) ...\n", "Selecting previously unselected package systemd-hwe-hwdb.\n", "Preparing to unpack .../systemd-hwe-hwdb_249.11.5_all.deb ...\n", "Unpacking systemd-hwe-hwdb (249.11.5) ...\n", "Setting up systemd-hwe-hwdb (249.11.5) ...\n", "Setting up chromium-browser (1:85.0.4183.83-0ubuntu2.22.04.1) ...\n", "update-alternatives: using /usr/bin/chromium-browser to provide /usr/bin/x-www-browser (x-www-browser) in auto mode\n", "update-alternatives: using /usr/bin/chromium-browser to provide /usr/bin/gnome-www-browser (gnome-www-browser) in auto mode\n", "Setting up chromium-chromedriver (1:85.0.4183.83-0ubuntu2.22.04.1) ...\n", "Processing triggers for udev (249.11-0ubuntu3.12) ...\n", "Processing triggers for hicolor-icon-theme (0.17-2) ...\n", "Processing triggers for libc-bin (2.35-0ubuntu3.4) ...\n", "/sbin/ldconfig.real: /usr/local/lib/libur_adapter_opencl.so.0 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libur_loader.so.0 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libur_adapter_level_zero.so.0 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link\n", "\n", "Processing triggers for man-db (2.10.2-1) ...\n", "Processing triggers for dbus (1.12.20-2ubuntu4.1) ...\n", "Collecting es-core-news-md==3.7.0\n", " Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.7.0/es_core_news_md-3.7.0-py3-none-any.whl (42.3 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m42.3/42.3 MB\u001b[0m \u001b[31m26.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: spacy<3.8.0,>=3.7.0 in /usr/local/lib/python3.10/dist-packages (from es-core-news-md==3.7.0) (3.7.5)\n", "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (3.0.12)\n", "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (1.0.5)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (1.0.10)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.0.8)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (3.0.9)\n", "Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (8.2.5)\n", "Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (1.1.3)\n", "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.4.8)\n", "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.0.10)\n", "Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (0.4.1)\n", "Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (0.12.3)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (4.66.5)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.32.3)\n", "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.8.2)\n", "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (3.1.4)\n", "Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (71.0.4)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (24.1)\n", "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (3.4.0)\n", "Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (1.26.4)\n", "Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.10/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (1.2.0)\n", "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.20.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.20.1)\n", "Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (4.12.2)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2024.7.4)\n", "Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (0.7.11)\n", "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (0.1.5)\n", "Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (8.1.7)\n", "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (1.5.4)\n", "Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (13.7.1)\n", "Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (0.18.1)\n", "Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (7.0.4)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.1.5)\n", "Requirement already satisfied: marisa-trie>=0.7.7 in /usr/local/lib/python3.10/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (1.2.0)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (3.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (2.16.1)\n", "Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (1.16.0)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.0->es-core-news-md==3.7.0) (0.1.2)\n", "Installing collected packages: es-core-news-md\n", "Successfully installed es-core-news-md-3.7.0\n", "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", "You can now load the package via spacy.load('es_core_news_md')\n", "\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n", "If you are in a Jupyter or Colab notebook, you may need to restart Python in\n", "order to load all the package's dependencies. You can do this by selecting the\n", "'Restart kernel' or 'Restart runtime' option.\n", "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)\n", "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)\n", "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.2)\n", "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2024.5.15)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.5)\n" ] } ], "source": [ "# Instalaciones\n", "!pip install selenium\n", "!apt-get update\n", "!apt install chromium-chromedriver\n", "import sys\n", "sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')\n", "!python -m spacy download es_core_news_md\n", "!pip install nltk\n", "\n", "# Librerías\n", "from selenium import webdriver\n", "from selenium.common.exceptions import TimeoutException, StaleElementReferenceException, NoSuchElementException\n", "from selenium.webdriver.common.by import By\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import json\n", "import time\n", "import random\n", "import urllib\n", "import re" ] }, { "cell_type": "markdown", "metadata": { "id": "oi5v0xehr9Bq" }, "source": [ "## Recolección de links de artículos" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "id": "rjmEC59zse1s" }, "outputs": [], "source": [ "# Categorías\n", "categories = [\n", " {\"category\": \"acompanamientos\", \"url\": 'https://www.recetasnestle.cl/categorias/acompanamientos'},\n", " {\"category\": \"almuerzo\", \"url\": 'https://www.recetasnestle.cl/categorias/almuerzo'},\n", " {\"category\": \"cena\", \"url\": 'https://www.recetasnestle.cl/categorias/cena'},\n", " {\"category\": \"postres\", \"url\": 'https://www.recetasnestle.cl/categorias/postres'},\n", " {\"category\": \"reposteria\", \"url\": 'https://www.recetasnestle.cl/categorias/reposteria'},\n", " {\"category\": \"ensaladas\", \"url\": 'https://www.recetasnestle.cl/categorias/ensaladas'},\n", " {\"category\": \"entradas\", \"url\": 'https://www.recetasnestle.cl/categorias/entradas'},\n", "]\n", "\n", "# Configuracion del Webdriver\n", "chrome_options = webdriver.ChromeOptions()\n", "chrome_options.add_argument('--headless')\n", "chrome_options.add_argument('--no-sandbox')\n", "chrome_options.add_argument('--disable-dev-shm-usage')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "id": "WsXq1XUpsoYA" }, "outputs": [], "source": [ "# Funcion Links\n", "# Scrapea para obtener links de articulos de noticias a partir\n", "# de una URL base que dirige hacia un listado de articulos\n", "def links(url):\n", " global chrome_options\n", "\n", " # Creacion y configuracion de Webdriver\n", " driver = webdriver.Chrome(options=chrome_options)\n", "\n", " try:\n", " # Inicializamos las variables\n", " data = []\n", " page = 1\n", "\n", " while True:\n", "\n", " print(f\"\\r\\tLinks recopilados: {len(data)} / 1000\", end=\"\")\n", " # Completamos la URL base y accedemos\n", " driver.get(f\"{url}?p={page}\")\n", "\n", " # Obtenemos la lista de elementos de articulos\n", " recipeElements = driver.find_element(By.CLASS_NAME, 'page-contents').find_elements(By.CLASS_NAME ,'recipeCard')\n", " for element in recipeElements:\n", " # Extraemos el link\n", " link = element.find_element(By.TAG_NAME, 'a')\n", " link = link.get_attribute('href')\n", " data.append(link)\n", "\n", " # Si llegamos a 1000 links, terminamos\n", " if len(data) == 1000:\n", " break\n", "\n", " # Si llegamos a 1000 links o ya no hay más recetas, terminamos\n", " if len(data) == 1000 or len(recipeElements) == 0:\n", " print(f\"\\r\\tLinks recopilados: {len(data)} / 1000\")\n", " print(f\"\\tSe han encontrado {len(data)} links\")\n", " break\n", "\n", " page += 1\n", "\n", " driver.quit()\n", "\n", " except TimeoutException:\n", " print(\"Se excedió el tiempo de busqueda de un link\")\n", "\n", " return data" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "id": "u88YRMIGs_lP" }, "outputs": [], "source": [ "def scraper(links, data, category):\n", " global chrome_options\n", " # Creacion y configuracion de Webdriver\n", " driver = webdriver.Chrome(options=chrome_options)\n", " progress = 0\n", " errors = 0\n", " for link in links:\n", " print(f\"\\r\\tLinks scrapeados: {progress} / {len(links)}\", end=\"\")\n", " try:\n", " # Accedemos a la receta\n", " driver.get(link)\n", " recipes = {}\n", " try:\n", " # Obtenemos el título\n", " title = driver.find_element(By.CLASS_NAME, 'recipeDetail__intro').find_element(By.TAG_NAME, 'h1').get_attribute('innerHTML')\n", "\n", " # Obtenemos el tiempo de preparación en minutos\n", " preparationTime = driver.find_element(By.CLASS_NAME, 'recipeDetail__infoItem--time').get_attribute('textContent')\n", " preparationTime = int(re.findall(\"\\d+\", preparationTime)[0])\n", "\n", " # Obtenemos las porciones\n", " servings = driver.find_element(By.CLASS_NAME, 'recipeDetail__infoItem--serving').find_element(By.TAG_NAME, 'span').get_attribute(\"textContent\")\n", " servings = int(servings)\n", "\n", " # Obtenemos los ingredientes\n", " ingredientsList = []\n", " ingredientElements = driver.find_element(By.CLASS_NAME, 'recipeDetail__ingredients').find_elements(By.TAG_NAME, \"li\")\n", " for element in ingredientElements:\n", " ingredient = element.get_attribute(\"textContent\")\n", " ingredientsList.append(ingredient)\n", "\n", " # Obtenemos los pasos a seguir\n", " stepsList = []\n", " stepElements = driver.find_elements(By.CLASS_NAME, \"recipeDetail__stepItem\")\n", " for element in stepElements:\n", " step = element.find_element(By.TAG_NAME, \"label\").get_attribute(\"textContent\")\n", " stepsList.append(step)\n", "\n", " # Asignamos las variables extraidas a un diccionario\n", " recipes['title'] = title\n", " recipes['category'] = category\n", " recipes['preparationTime'] = preparationTime\n", " recipes['servings'] = servings\n", " recipes['ingredients'] = ingredientsList\n", " recipes['steps'] = stepsList\n", " data.append(recipes)\n", "\n", " progress += 1\n", "\n", " if len(data) % 200 == 0:\n", " pass\n", "\n", " except StaleElementReferenceException:\n", " errors += 1\n", " pass\n", " except NoSuchElementException:\n", " errors += 1\n", " pass\n", " except TimeoutException:\n", " pass\n", "\n", " print(f\"\\r\\tLinks scrapeados: {progress} / {len(links)} (Errores: {errors})\")\n", " print(f\"\\tSe han scrapeado {len(data)} recetas distintas\")\n", " driver.quit()\n", " return data" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "id": "NsNPa1w8taMJ" }, "outputs": [], "source": [ "# Funcion collect\n", "# Realiza el scrapeo de links y contenido de articulos de forma secuencial\n", "# a partir de las categorías base\n", "\n", "def collect(categories):\n", " data = list()\n", " for categoryDict in categories:\n", " category, url = categoryDict[\"category\"], categoryDict[\"url\"]\n", "\n", " print(f\"Iniciando Scraper de la categoría `{category}`\")\n", " print('Scrapeando links de recetas...')\n", " collected_links = links(url)\n", " print(f\"Listo. La cantidad de links disponibles para scrapear es: {len(collected_links)}\")\n", "\n", " nToSelect = min(200, len(collected_links))\n", " print(f\"Seleccionando muestra aleatoria de {nToSelect} links...\")\n", " selected_links = random.sample(collected_links, nToSelect)\n", "\n", " print('Scrapeando contenido de links...')\n", " data = scraper(selected_links, data, category)\n", " print(f\"Scrapeo de la categoria `{category}` finalizado\\n\")\n", " return data" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "86HA2ZxytvRQ", "outputId": "906e7a87-8ec5-4509-825f-55acd4381d81" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Iniciando Scraper de la categoría `acompanamientos`\n", "Scrapeando links de recetas...\n", "\tLinks recopilados: 60 / 1000\n", "\tSe han encontrado 60 links\n", "Listo. La cantidad de links disponibles para scrapear es: 60\n", "Seleccionando muestra aleatoria de 60 links...\n", "Scrapeando contenido de links...\n", "\tLinks scrapeados: 60 / 60 (Errores: 0)\n", "\tSe han scrapeado 60 recetas distintas\n", "Scrapeo de la categoria `acompanamientos` finalizado\n", "\n", "Iniciando Scraper de la categoría `almuerzo`\n", "Scrapeando links de recetas...\n", "\tLinks recopilados: 576 / 1000\n", "\tSe han encontrado 576 links\n", "Listo. La cantidad de links disponibles para scrapear es: 576\n", "Seleccionando muestra aleatoria de 200 links...\n", "Scrapeando contenido de links...\n", "\tLinks scrapeados: 200 / 200 (Errores: 0)\n", "\tSe han scrapeado 260 recetas distintas\n", "Scrapeo de la categoria `almuerzo` finalizado\n", "\n", "Iniciando Scraper de la categoría `cena`\n", "Scrapeando links de recetas...\n", "\tLinks recopilados: 408 / 1000\n", "\tSe han encontrado 408 links\n", "Listo. La cantidad de links disponibles para scrapear es: 408\n", "Seleccionando muestra aleatoria de 200 links...\n", "Scrapeando contenido de links...\n", "\tLinks scrapeados: 200 / 200 (Errores: 0)\n", "\tSe han scrapeado 460 recetas distintas\n", "Scrapeo de la categoria `cena` finalizado\n", "\n", "Iniciando Scraper de la categoría `postres`\n", "Scrapeando links de recetas...\n", "\tLinks recopilados: 242 / 1000\n", "\tSe han encontrado 242 links\n", "Listo. La cantidad de links disponibles para scrapear es: 242\n", "Seleccionando muestra aleatoria de 200 links...\n", "Scrapeando contenido de links...\n", "\tLinks scrapeados: 200 / 200 (Errores: 0)\n", "\tSe han scrapeado 660 recetas distintas\n", "Scrapeo de la categoria `postres` finalizado\n", "\n", "Iniciando Scraper de la categoría `reposteria`\n", "Scrapeando links de recetas...\n", "\tLinks recopilados: 259 / 1000\n", "\tSe han encontrado 259 links\n", "Listo. La cantidad de links disponibles para scrapear es: 259\n", "Seleccionando muestra aleatoria de 200 links...\n", "Scrapeando contenido de links...\n", "\tLinks scrapeados: 199 / 200 (Errores: 1)\n", "\tSe han scrapeado 859 recetas distintas\n", "Scrapeo de la categoria `reposteria` finalizado\n", "\n", "Iniciando Scraper de la categoría `ensaladas`\n", "Scrapeando links de recetas...\n", "\tLinks recopilados: 49 / 1000\n", "\tSe han encontrado 49 links\n", "Listo. La cantidad de links disponibles para scrapear es: 49\n", "Seleccionando muestra aleatoria de 49 links...\n", "Scrapeando contenido de links...\n", "\tLinks scrapeados: 49 / 49 (Errores: 0)\n", "\tSe han scrapeado 908 recetas distintas\n", "Scrapeo de la categoria `ensaladas` finalizado\n", "\n", "Iniciando Scraper de la categoría `entradas`\n", "Scrapeando links de recetas...\n", "\tLinks recopilados: 43 / 1000\n", "\tSe han encontrado 43 links\n", "Listo. La cantidad de links disponibles para scrapear es: 43\n", "Seleccionando muestra aleatoria de 43 links...\n", "Scrapeando contenido de links...\n", "\tLinks scrapeados: 42 / 43 (Errores: 1)\n", "\tSe han scrapeado 950 recetas distintas\n", "Scrapeo de la categoria `entradas` finalizado\n", "\n" ] } ], "source": [ "# Scrapeamos y almacenamos los resultados en un .json\n", "data = collect(categories)\n", "db = pd.DataFrame(data)\n", "with open('nestle_recipes.json', 'w', encoding='utf-8') as file:\n", " db.to_json(file, force_ascii=False)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "Pgb2tHeeuEZU", "outputId": "0b996bc1-72da-4a20-c605-99cd25ad892d" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title category \\\n", "0 \\n Puré de Betarraga\\n acompanamientos \n", "1 \\n Receta deliciosa de panes saborizados\\n acompanamientos \n", "2 \\n Ensalada de zapallo asado, pimiento, kale ... acompanamientos \n", "3 \\n Chupe verde de Espinacas, Champiñon y rico... acompanamientos \n", "4 \\n Crema de zapallo italiano y especias\\n acompanamientos \n", "\n", " preparationTime servings \\\n", "0 1 9 \n", "1 70 10 \n", "2 37 3 \n", "3 30 6 \n", "4 2 6 \n", "\n", " ingredients \\\n", "0 [\\n\\t\\t\\t 1 Caja de puré de papas MAGGI® de... \n", "1 [\\n\\t\\t\\t 3 Tazas de harina\\n\\n\\t\\t, \\n\\t\\t... \n", "2 [\\n\\t\\t\\t 200 gr de zapallo cortado en cubo... \n", "3 [\\n\\t\\t\\t ½ \\tCebolla mediana cortada en pl... \n", "4 [\\n\\t\\t\\t 3\\tZapallos italianos cortados en... \n", "\n", " steps \n", "0 [\\n \\n 1.1.- P... \n", "1 [\\n \\n 1.1.- Dej... \n", "2 [\\n \\n 1.En una ... \n", "3 [\\n \\n 1.Comienz... \n", "4 [\\n \\n 1.Comienz... " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecategorypreparationTimeservingsingredientssteps
0\\n Puré de Betarraga\\nacompanamientos19[\\n\\t\\t\\t 1 Caja de puré de papas MAGGI® de...[\\n \\n 1.1.- P...
1\\n Receta deliciosa de panes saborizados\\nacompanamientos7010[\\n\\t\\t\\t 3 Tazas de harina\\n\\n\\t\\t, \\n\\t\\t...[\\n \\n 1.1.- Dej...
2\\n Ensalada de zapallo asado, pimiento, kale ...acompanamientos373[\\n\\t\\t\\t 200 gr de zapallo cortado en cubo...[\\n \\n 1.En una ...
3\\n Chupe verde de Espinacas, Champiñon y rico...acompanamientos306[\\n\\t\\t\\t ½ \\tCebolla mediana cortada en pl...[\\n \\n 1.Comienz...
4\\n Crema de zapallo italiano y especias\\nacompanamientos26[\\n\\t\\t\\t 3\\tZapallos italianos cortados en...[\\n \\n 1.Comienz...
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "db", "summary": "{\n \"name\": \"db\",\n \"rows\": 950,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 805,\n \"samples\": [\n \"\\n Gnocchi de Garbanzos con salsa Tuco\\n\",\n \"\\n Corona de Chocolate\\n\",\n \"\\n Ensalada C\\u00e9sar de Camaron\\n\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"category\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"acompanamientos\",\n \"almuerzo\",\n \"ensaladas\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"preparationTime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 22,\n \"min\": 0,\n \"max\": 155,\n \"num_unique_values\": 78,\n \"samples\": [\n 31,\n 1,\n 43\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"servings\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 1,\n \"max\": 60,\n \"num_unique_values\": 24,\n \"samples\": [\n 15,\n 14,\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"ingredients\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"steps\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 30 } ], "source": [ "# Visualizamos los resultados\n", "db.head()" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "J-7EkSLSAGdj", "outputId": "1f211a0c-7fdd-4dc4-fe55-fc270ddb6991" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title \\\n", "0 \\n Puré de Betarraga\\n \n", "1 \\n Receta deliciosa de panes saborizados\\n \n", "2 \\n Ensalada de zapallo asado, pimiento, kale ... \n", "3 \\n Chupe verde de Espinacas, Champiñon y rico... \n", "4 \\n Crema de zapallo italiano y especias\\n \n", "\n", " ingredients \\\n", "0 [\\n\\t\\t\\t 1 Caja de puré de papas MAGGI® de... \n", "1 [\\n\\t\\t\\t 3 Tazas de harina\\n\\n\\t\\t, \\n\\t\\t... \n", "2 [\\n\\t\\t\\t 200 gr de zapallo cortado en cubo... \n", "3 [\\n\\t\\t\\t ½ \\tCebolla mediana cortada en pl... \n", "4 [\\n\\t\\t\\t 3\\tZapallos italianos cortados en... \n", "\n", " steps category \\\n", "0 [\\n \\n 1.1.- P... acompanamientos \n", "1 [\\n \\n 1.1.- Dej... acompanamientos \n", "2 [\\n \\n 1.En una ... acompanamientos \n", "3 [\\n \\n 1.Comienz... acompanamientos \n", "4 [\\n \\n 1.Comienz... acompanamientos \n", "\n", " preparationTime \n", "0 1 \n", "1 70 \n", "2 37 \n", "3 30 \n", "4 2 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleingredientsstepscategorypreparationTime
0\\n Puré de Betarraga\\n[\\n\\t\\t\\t 1 Caja de puré de papas MAGGI® de...[\\n \\n 1.1.- P...acompanamientos1
1\\n Receta deliciosa de panes saborizados\\n[\\n\\t\\t\\t 3 Tazas de harina\\n\\n\\t\\t, \\n\\t\\t...[\\n \\n 1.1.- Dej...acompanamientos70
2\\n Ensalada de zapallo asado, pimiento, kale ...[\\n\\t\\t\\t 200 gr de zapallo cortado en cubo...[\\n \\n 1.En una ...acompanamientos37
3\\n Chupe verde de Espinacas, Champiñon y rico...[\\n\\t\\t\\t ½ \\tCebolla mediana cortada en pl...[\\n \\n 1.Comienz...acompanamientos30
4\\n Crema de zapallo italiano y especias\\n[\\n\\t\\t\\t 3\\tZapallos italianos cortados en...[\\n \\n 1.Comienz...acompanamientos2
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df", "summary": "{\n \"name\": \"df\",\n \"rows\": 950,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 805,\n \"samples\": [\n \"\\n Gnocchi de Garbanzos con salsa Tuco\\n\",\n \"\\n Corona de Chocolate\\n\",\n \"\\n Ensalada C\\u00e9sar de Camaron\\n\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"ingredients\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"steps\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"category\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"acompanamientos\",\n \"almuerzo\",\n \"ensaladas\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"preparationTime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 22,\n \"min\": 0,\n \"max\": 155,\n \"num_unique_values\": 78,\n \"samples\": [\n 31,\n 1,\n 43\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 67 } ], "source": [ "# Cargamos los datos guardados\n", "df = pd.read_json('nestle_recipes.json')\n", "\n", "# Seleccionamos las features y las variables objetivo\n", "df = df[[\"title\", \"ingredients\", \"steps\", \"category\", \"preparationTime\"]].copy()\n", "\n", "# Visualizamos las features\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "OvbO_ANcAZVp" }, "source": [ "# Pre-procesamiento" ] }, { "cell_type": "markdown", "metadata": { "id": "Yoi8AmtUBBs2" }, "source": [ "Comenzaremos seleccionando solo las categorías deseadas. Para ellos, se revisará cuales categorías existen en el dataset y se seleccionarán manualmente con cuales se desea trabajar." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_qPv8ji9AqPo", "outputId": "adfbcfc8-bf83-4bcc-9644-0e64108da6cf" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array(['acompanamientos', 'almuerzo', 'cena', 'postres', 'reposteria',\n", " 'ensaladas', 'entradas'], dtype=object)" ] }, "metadata": {}, "execution_count": 68 } ], "source": [ "# Visualizamos las categorias disponibles\n", "df['category'].unique()" ] }, { "cell_type": "code", "source": [ "# Revisamos cuantas recetas se recopilaron de cada categoría\n", "df.groupby(\"category\").count()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 300 }, "id": "-PTD24433nc_", "outputId": "70441d78-977c-4f23-981e-c52a856fd2c9" }, "execution_count": 69, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title ingredients steps preparationTime\n", "category \n", "acompanamientos 60 60 60 60\n", "almuerzo 200 200 200 200\n", "cena 200 200 200 200\n", "ensaladas 49 49 49 49\n", "entradas 42 42 42 42\n", "postres 200 200 200 200\n", "reposteria 199 199 199 199" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleingredientsstepspreparationTime
category
acompanamientos60606060
almuerzo200200200200
cena200200200200
ensaladas49494949
entradas42424242
postres200200200200
reposteria199199199199
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"df\",\n \"rows\": 7,\n \"fields\": [\n {\n \"column\": \"category\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"acompanamientos\",\n \"almuerzo\",\n \"postres\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 80,\n \"min\": 42,\n \"max\": 200,\n \"num_unique_values\": 5,\n \"samples\": [\n 200,\n 199,\n 49\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"ingredients\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 80,\n \"min\": 42,\n \"max\": 200,\n \"num_unique_values\": 5,\n \"samples\": [\n 200,\n 199,\n 49\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"steps\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 80,\n \"min\": 42,\n \"max\": 200,\n \"num_unique_values\": 5,\n \"samples\": [\n 200,\n 199,\n 49\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"preparationTime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 80,\n \"min\": 42,\n \"max\": 200,\n \"num_unique_values\": 5,\n \"samples\": [\n 200,\n 199,\n 49\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 69 } ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mIqsyWOuBQOJ", "outputId": "03f951ad-42d0-46ea-e35a-3e9e8285b02a" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array(['acompanamientos', 'almuerzo', 'postres', 'reposteria',\n", " 'ensaladas', 'entradas'], dtype=object)" ] }, "metadata": {}, "execution_count": 70 } ], "source": [ "# Seleccionamos las categorias a utilizar\n", "selected_categories = ['acompanamientos', 'almuerzo', 'cena', 'ensaladas', 'entradas', 'postres', 'reposteria']\n", "df = df[df['category'].isin(selected_categories)]\n", "df['category'].unique()" ] }, { "cell_type": "markdown", "metadata": { "id": "dNkVOh14BfKe" }, "source": [ "A continuación, se realiza una limpieza de los datos." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Qk-nG4NBB62T", "outputId": "903c65a1-f357-40ab-d48a-33ec873ece1c" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "title \\n Puré Picante al Pimentón\\n\n", "ingredients [\\n\\t\\t\\t 1 Estuche de puré de papas MAGGI®...\n", "steps [\\n \\n 1.1.- Pre...\n", "category acompanamientos\n", "preparationTime 16\n", "Name: 22, dtype: object\n", "'\\n Puré Picante al Pimentón\\n'\n", "['\\n\\t\\t\\t 1 Estuche de puré de papas MAGGI® (125 g)\\n\\n\\t\\t', '\\n\\t\\t\\t 1 Cucharadita de sal\\n\\n\\t\\t', '\\n\\t\\t\\t 2 Cucharadas de mantequilla\\n\\n\\t\\t', '\\n\\t\\t\\t 1 Taza de leche\\n\\n\\t\\t', '\\n\\t\\t\\t 2 Cucharaditas rasas de ají en pasta\\n\\n\\t\\t', '\\n\\t\\t\\t 1/4 Cucharadita de merkén a gusto\\n\\n\\t\\t', '\\n\\t\\t\\t 1/2 Pimentón rojo cortado en cubos pequeños y salteados\\n\\n\\t\\t']\n", "['\\n \\n 1.1.- Prepara el puré de papas MAGGI® según las indicaciones del envase en el agua caliente con la mantequilla y la sal, agrega la leche y luego el puré de papas MAGGI®. Una vez reposado e hidratado agrega el ají en pasta con el merkén.\\n\\n \\n ', '\\n \\n 2.2.- Luego, agrega de inmediato el pimentón salteado y revuelve con suaves movimientos hasta homogenizar los ingredientes. Una vez listo sirve de inmediato, acompañado de algún tipo de carne que más te guste.\\n\\n \\n ', '\\n \\n 3.3.- Ya puedes disfrutar de tu Puré Picante al Pimentón.\\n\\n \\n ']\n", "'acompanamientos'\n", "16\n" ] } ], "source": [ "# Revisamos los valores de alguna fila\n", "example_row = df.iloc[22, :]\n", "print(example_row)\n", "for value in example_row: print(repr(value))" ] }, { "cell_type": "code", "source": [ "# Unimos los elementos de las listas de ingredientes y pasos\n", "df[\"ingredients\"] = df[\"ingredients\"].map(lambda x: \", \".join(x))\n", "df[\"steps\"] = df[\"steps\"].map(lambda x: \", \".join(x))" ], "metadata": { "id": "xQI6UMS_AYBh", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "1a15370b-e97f-4cda-81f4-82e754bd7fff" }, "execution_count": 72, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ ":2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df[\"ingredients\"] = df[\"ingredients\"].map(lambda x: \", \".join(x))\n", ":3: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df[\"steps\"] = df[\"steps\"].map(lambda x: \", \".join(x))\n" ] } ] }, { "cell_type": "code", "source": [ "# Generamos una concatenación de los features\n", "df[\"recipe_content\"] = df[\"title\"] + df[\"ingredients\"] + df[\"steps\"]\n", "df[\"recipe_content\"].iloc[0]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 255 }, "id": "o6gCAitIZtD-", "outputId": "bdfccfca-8710-432f-b05d-fbbfb5e7806f" }, "execution_count": 73, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ ":2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df[\"recipe_content\"] = df[\"title\"] + df[\"ingredients\"] + df[\"steps\"]\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "'\\n Puré de Betarraga\\n\\n\\t\\t\\t 1 Caja de puré de papas MAGGI® de 250 g\\n\\n\\t\\t, \\n\\t\\t\\t 2 Betarragas cocidas\\n\\n\\t\\t, \\n\\t\\t\\t 1 Ramito de ciboulette cortado finamente\\n\\n\\t\\t, \\n\\t\\t\\t 60 g de maní tostado\\n\\n\\t\\t, \\n\\t\\t\\t 2 Huevos duros y rallados\\n\\n\\t\\t, \\n\\t\\t\\t 2 Pizcas de pimienta negra\\n\\n\\t\\t\\n \\n 1.1.- Prepara el puré de papas MAGGI® según las indicaciones del envase con el agua caliente, mantequilla, sal y leche. Deja reposando tapado y recuerda que no es necesario batir.\\n\\n \\n , \\n \\n 2.2.- Aparte muele las betarragas ya cocidas con la ayuda de una mini pimer o juguera, agrega esta molienda al puré de papas ya preparado y mezcla con suaves movimientos. Una vez listo, sirve en una fuente y agrega el superficie el huevo rallado con el maní tostado y finalmente con el ciboulette cortado finamente.\\n\\n \\n , \\n \\n 3.3.- Ya puedes disfrutar de tu Puré de betarraga.\\n\\n \\n '" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 73 } ] }, { "cell_type": "code", "source": [ "# Creamos una función para limpiar tags HTML, saltos de línea y espacios innecesarios\n", "def clean_string(string):\n", " if string == None:\n", " return string\n", "\n", " html_tags = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')\n", " spaces = re.compile('\\s\\s+')\n", "\n", " string = re.sub(html_tags, \" \", string)\n", " string = re.sub(spaces, \" \", string)\n", " string = string.replace(\"\\t\", \" \")\n", " string = string.replace(\" ,\", \",\")\n", " string = string.strip(\" \").strip(\"\\n\")\n", " return string" ], "metadata": { "id": "ymp1T4FfCiI6" }, "execution_count": 74, "outputs": [] }, { "cell_type": "code", "execution_count": 75, "metadata": { "id": "fN677jGeCUSy" }, "outputs": [], "source": [ "# Limpiamos los strings del dataset\n", "df[\"recipe_content\"] = df[\"recipe_content\"].map(clean_string)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 109 }, "id": "NBe9lF2qJ5lB", "outputId": "265a3abb-90ed-4a7a-f999-fea1df5d1d50" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'Puré Picante al Pimentón 1 Estuche de puré de papas MAGGI® (125 g), 1 Cucharadita de sal, 2 Cucharadas de mantequilla, 1 Taza de leche, 2 Cucharaditas rasas de ají en pasta, 1/4 Cucharadita de merkén a gusto, 1/2 Pimentón rojo cortado en cubos pequeños y salteados 1.1.- Prepara el puré de papas MAGGI® según las indicaciones del envase en el agua caliente con la mantequilla y la sal, agrega la leche y luego el puré de papas MAGGI®. Una vez reposado e hidratado agrega el ají en pasta con el merkén., 2.2.- Luego, agrega de inmediato el pimentón salteado y revuelve con suaves movimientos hasta homogenizar los ingredientes. Una vez listo sirve de inmediato, acompañado de algún tipo de carne que más te guste., 3.3.- Ya puedes disfrutar de tu Puré Picante al Pimentón.'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 76 } ], "source": [ "df[\"recipe_content\"].iloc[22]" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WN6FevzfEmMc", "outputId": "12216b12-7537-46d0-cf2c-2f1b887f6de3" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n", "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import nltk\n", "import string\n", "nltk.download('punkt')\n", "nltk.download('stopwords')\n", "from nltk.corpus import stopwords\n", "from nltk.tokenize import word_tokenize\n", "from sklearn.feature_extraction.text import strip_accents_ascii\n", "\n", "# Eliminamos acentos\n", "df['recipe_content'] = df['recipe_content'].apply(lambda x: strip_accents_ascii(x) if x is not None else x)\n", "\n", "# Transformamos las letras a minusculas\n", "df['recipe_content'] = df['recipe_content'].str.lower()\n", "\n", "# Eliminamos las puntuaciones\n", "regex = re.compile('[%s]' % re.escape(string.punctuation))\n", "df['recipe_content'] = df['recipe_content'].apply(lambda x: regex.sub('', x) if x is not None else x)\n", "\n", "# Eliminamos los espacios extra\n", "df['recipe_content'] = df['recipe_content'].apply(lambda x: ' '.join(x.split()) if x is not None else x)\n", "\n", "# Eliminar las stopwords\n", "spanish_stopwords = stopwords.words('spanish')\n", "df['recipe_content'] = df['recipe_content'].apply(lambda x: ' '.join([i for i in word_tokenize(x) if i not in spanish_stopwords]) if x is not None else x)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 91 }, "id": "0XLC6FtnVfz9", "outputId": "4660501e-5c87-4ca6-deaa-afdcbd2e5dec" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'pure picante pimenton 1 estuche pure papas maggi 125 g 1 cucharadita sal 2 cucharadas mantequilla 1 taza leche 2 cucharaditas rasas aji pasta 14 cucharadita merken gusto 12 pimenton rojo cortado cubos pequenos salteados 11 prepara pure papas maggi segun indicaciones envase agua caliente mantequilla sal agrega leche luego pure papas maggi vez reposado hidratado agrega aji pasta merken 22 luego agrega inmediato pimenton salteado revuelve suaves movimientos homogenizar ingredientes vez listo sirve inmediato acompanado algun tipo carne mas guste 33 puedes disfrutar pure picante pimenton'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 78 } ], "source": [ "df[\"recipe_content\"].iloc[22]" ] }, { "cell_type": "markdown", "metadata": { "id": "HNu1eNZwFzXJ" }, "source": [ "# Transformación" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 91 }, "id": "98a_tbNeXXdL", "outputId": "cbbe403f-f7f6-4a52-bd1f-258f9c7c30a0" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'pure picante pimenton 1 estuche pure papas maggi 125 g 1 cucharadita sal 2 cucharada mantequilla 1 taza leche 2 cucharadita rasa aji pasta 14 cucharaditar merken gusto 12 pimenton rojo cortado cubo pequeno salteado 11 preparar pure papas maggi segun indicación envase agua caliente mantequillar sal agregar leche luego pure papas maggi vez reposado hidratado agregar aji pasta merken 22 luego agregar inmediato pimenton salteado revolver suave movimiento homogenizar ingrediente vez listo servir inmediato acompanado algun tipo carne mas gustar 33 poder disfrutar pure picante pimenton'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 79 } ], "source": [ "# Lematization\n", "import spacy\n", "nlp = spacy.load('es_core_news_md')\n", "\n", "test_lemma = df['recipe_content'].iloc[22]\n", "test_lemma = ' '.join([token.lemma_ for token in nlp(test_lemma)])\n", "test_lemma\n" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 91 }, "id": "rb7SM-hPXaf3", "outputId": "08f0f0f7-2f2e-45fe-9967-81df877ef2b2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'pur picant pimenton 1 estuch pur pap maggi 125 g 1 cucharadit sal 2 cuchar mantequill 1 taz lech 2 cucharadit ras aji past 14 cucharadit merk gust 12 pimenton roj cort cub pequen salt 11 prep pur pap maggi segun indic envas agu calient mantequill sal agreg lech lueg pur pap maggi vez rep hidrat agreg aji past merk 22 lueg agreg inmediat pimenton salt revuelv suav movimient homogeniz ingredient vez list sirv inmediat acompan algun tip carn mas gust 33 pued disfrut pur picant pimenton'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 80 } ], "source": [ "# Stemming\n", "from nltk import SnowballStemmer\n", "spanishstemmer = SnowballStemmer('spanish')\n", "\n", "test_stemming = df['recipe_content'].iloc[22]\n", "test_stemming = ' '.join(spanishstemmer.stem(token) for token in word_tokenize(test_stemming))\n", "test_stemming" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "id": "yBvf31sJGVta" }, "outputs": [], "source": [ "# Usaremos Lemmatization\n", "df['recipe_content'] = df['recipe_content'].apply(lambda x: ' '.join([tok.lemma_ for tok in nlp(x)] if x is not None else x))" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "id": "yc5lyfEFHlSF" }, "outputs": [], "source": [ "# Generamos un Bag of Words\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "count_vectorizer = CountVectorizer()\n", "bow = count_vectorizer.fit_transform(df['recipe_content'])\n", "\n", "# Calculamos un tfidf\n", "from sklearn.feature_extraction.text import TfidfTransformer\n", "tfidf = TfidfTransformer()\n", "X = tfidf.fit_transform(bow).toarray()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "id": "ONJE_Dxrr1hc" }, "outputs": [], "source": [ "# Otra forma\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "tfidfvectorizer = TfidfVectorizer()\n", "X = tfidfvectorizer.fit_transform(df['recipe_content'])" ] }, { "cell_type": "markdown", "metadata": { "id": "lSdBSmGEF3mj" }, "source": [ "# Data Mining" ] }, { "cell_type": "code", "source": [ "from imblearn.over_sampling import RandomOverSampler\n", "y_cat = df[\"category\"]\n", "y_prepTime = df[\"preparationTime\"]\n", "\n", "ros = RandomOverSampler(random_state=8)\n", "X_cat_resampled, y_cat_resampled = ros.fit_resample(X, y_cat)\n", "X_prepTime_resampled, y_prepTime_resampled = ros.fit_resample(X, y_prepTime)\n", "\n", "print(X.shape[0])\n", "print(X_cat_resampled.shape[0])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_j3-jsbWv7pi", "outputId": "21293a58-a5af-4563-bfbd-956c3a079288" }, "execution_count": 84, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "750\n", "1200\n" ] } ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "id": "b2sZeGn5GWRo" }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LinearRegression\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X_prepTime_resampled, y_prepTime_resampled, train_size=0.8, test_size=0.2, random_state=8)\n", "lr = LinearRegression()\n", "lr.fit(X_train, y_train)\n", "pred_lr = lr.predict(X_test)" ] }, { "cell_type": "code", "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X_cat_resampled, y_cat_resampled, train_size=0.8, test_size=0.2, stratify=y_cat_resampled, random_state=8)\n", "rfc = RandomForestClassifier(random_state=8)\n", "rfc.fit(X_train, y_train)\n", "pred_rfc = rfc.predict(X_test)" ], "metadata": { "id": "Ww3pFcsWdVS2" }, "execution_count": 86, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cygTDBfmF528" }, "source": [ "# Evaluación" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "FdNX1cREGW4c", "outputId": "df57bb04-d340-4291-dade-2e382174cd56" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Regression Report LR\n", "0.9960472815490728\n" ] } ], "source": [ "print(\"Regression Report LR\")\n", "print(lr.score(X_prepTime_resampled, y_prepTime_resampled))" ] }, { "cell_type": "code", "source": [ "from sklearn.metrics import classification_report\n", "print(\"Classification Report RFC\")\n", "print(classification_report(y_test, pred_rfc))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bnfJ09b6gKH_", "outputId": "afb51eef-eeaf-4a3e-8adc-527c6600ccde" }, "execution_count": 88, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Classification Report RFC\n", " precision recall f1-score support\n", "\n", "acompanamientos 0.91 0.75 0.82 40\n", " almuerzo 0.96 0.68 0.79 40\n", " ensaladas 0.76 0.93 0.83 40\n", " entradas 0.79 0.93 0.85 40\n", " postres 0.77 0.90 0.83 40\n", " reposteria 0.83 0.75 0.79 40\n", "\n", " accuracy 0.82 240\n", " macro avg 0.84 0.82 0.82 240\n", " weighted avg 0.84 0.82 0.82 240\n", "\n" ] } ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Z_YG9lagJhAu", "outputId": "303539c2-b9ce-429c-eeab-f33f09a141c2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[30, 0, 7, 3, 0, 0],\n", " [ 1, 27, 4, 6, 0, 2],\n", " [ 2, 0, 37, 1, 0, 0],\n", " [ 0, 1, 1, 37, 1, 0],\n", " [ 0, 0, 0, 0, 36, 4],\n", " [ 0, 0, 0, 0, 10, 30]])" ] }, "metadata": {}, "execution_count": 89 } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "# Matriz de confusión\n", "confusion_matrix(y_test, pred_rfc)" ] }, { "cell_type": "markdown", "metadata": { "id": "qiQYb9K8cwHv" }, "source": [ "# ¿Tienes alguna sugerencia para mejorar este material?\n", "Escríbeme a bstnbas3@gmail.com y coméntame qué te gustaría ver en un auxiliar como este :)\n", "\n", "" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }