Commit e233e7f9 authored by marioromera's avatar marioromera
Browse files

Add documentation

parent d7fc477e
Pipeline #706 failed with stages
......@@ -7,6 +7,11 @@ Para hacerlo funcionar:
1. instalar con `pip install -r requirements.txt`
2. ejectuar `python get_files_boe.py` (dentro de ese archivo podeis cambiar el rango de fechas)
3. ejecutar `python text_analyzer.py` (ahi estan hard-coded las palabras clave q se buscaran, cambiarlas si os parece)
4½. o ejecutar `python get_files_boe.py && python text_analyzer.py`
Hay 300.000.000.000 boes por día, os sugiero q probeis con un rango pequeño y o canceleis el get_files_boe.py (Ctrl + c)
y luego ejecuteis el analizador
\ No newline at end of file
y luego ejecuteis el analizador
Improvements/Next steps:
- Download all files
- Create server with all documents indexed, and an api to query against
\ No newline at end of file
......@@ -6,7 +6,7 @@ from datetime import timedelta, date
import io
# Input dates to download all files published in between
start_date = date(2020, 5, 25)
start_date = date(2020, 1, 1)
end_date = date(2020, 5, 28)
......
......@@ -6,7 +6,7 @@ from textacy import preprocessing
import textacy.vsm
import csv
# Input words to search for
# Input word to search for, in the future can be a list
searched_words = ["google", "amazon", "netflix", "facebook", "microsoft", "apple", "vodafone", "alibaba",
"tesla",
"twitter", "telefónica", "indra",
......@@ -66,7 +66,7 @@ doc_term_matrix = vectorizer.fit_transform(get_tokenize_docs(corpus))
with open('results.csv', 'w', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["word", "tf", "idf", "pseudo tfidf"])
writer.writerow(["word", "tf", "idf", "tfidf"])
for searched_word in searched_words:
searched_word = searched_word.lower()
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment