Commit cd0ead6c authored by marioromera's avatar marioromera
Browse files

Analyze text extracting tf, idf and tf-idf

parent 957cf614
Aunq el codigo funciona, todavia quedaría muchas tecnicas por explorar y optimizar, aun así los resultados no son tan enriquecedores, por las siguientes razones:
-Requiere bajarse mogollon de archivos
-Hace falta bastante energia para procesar tanto
\ No newline at end of file
......@@ -5,8 +5,6 @@ from os import walk
from datetime import timedelta, date
import io
from pdf_to_txt import convert_pdf_to_txt
# Returns an array containing all dates in between start and end
def daterange(start, end):
......@@ -6,7 +6,7 @@ from textacy import preprocessing
import textacy.vsm
# Folder from where files will be loaded
pdfs_folder_path = "./downloaded_files/texts_to_debug"
pdfs_folder_path = "./downloaded_files/texts"
spacy_stopwords =
......@@ -53,8 +53,8 @@ def get_tokenize_docs(documents):
# Input word to search for, in the future can be a list
searched_word = "concurso".lower()
print("Finding needle in the haystack")
searched_word = "google".lower()
print(f"Finding {searched_word} needle in the haystack")
# Creates a matrice of all the terms in the documents, first creates the vector then fill it with vocabulary
vectorizer = textacy.vsm.Vectorizer(apply_idf=False, norm=None)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment