COVIPEDIA
A recommendation system for navigating COVID-19 research articles with NLP and Unsupervised ML Topic Modeling
The goal of this project is build a recommendation system for scientists and researchers to navigate the current surge of papers about COVID-19, find what is relevant to their work, and uncover the hidden semantic relationships. Using the COVID-19 Open Research Dataset, I used the abstract of the subset of articles from January 2020 to May 2021 (about 260,000 articles) as text in this project. With the LDA model, I assigned each documents with dominant topic and their relevance to the topic and grouped articles by topics for recommendation system. So researchers can look up articles based on topic that is related to their work. Lastly, I deployed a Strealit app on Heroku with a smaller dataset that recommends top 20 related articles for the selected topic.
Tools
- Python (Numpy, Pandas)
- langdetect, regex
- spaCy, scispaCy ("en_core_sci_lg" model for biomedical, scientific, and clinical vocabulary)
- NLTK
- Gensim - LDA
- WordCloud
- Scikit-learn
- pyLDAvis
- Streamlit, Heroku
Techniques/Algorithms
- Text Preprocessing
- Data Transformation
- Topic Modeling
Application Usage
The model was built in an web application with a smaller dataset (due to the size limit on GitHub) for demo usage.
To Learn More, Check Out My:
Note: The app can take awhile to load... please be patient :)