Exploration and Discovery of the Covid-19 Literature through Semantic Visualization
Jingxuan Tu, Marc Verhagen, Kyeongmin Rim, Kelley Lynch, Peter Anick, Nikhil Krishnaswamy, James Pustejovsky (Brandeis)
Keith Suderman, Nancy Ide (Vassar)
Brent Cochran (Tufts)
We are developing semantic visualization techniques in order to enhance exploration and enable discovery over large datasets of complex networks of relations. Semantic visualization is a method of enabling exploration and discovery over large datasets of complex networks by exploiting the semantics of the relations in them. This involves (i) NLP to extract named entities, relations and knowledge graphs from the original data; (ii) indexing the output and creating representations for all relevant entities and relations that can be visualized in many different ways, e.g., as wordclouds, heat maps, graphs, etc.; (iii) applying parameter reduction operations to the extracted relations, creating “relation containers,” or functional entities that can also be visualized using the same methods, allowing the visualization of multiple relations, partial pathways, and exploration across multiple dimensions. Our hope is that this will enable the discovery of novel inferences over relations in complex data that otherwise would go unnoticed. We have applied this to analysis of the recently released CORD-19 dataset.
Check out the SemViz tutorial on how to navigate the INDRA protein-protein dataset, prepared by Prof. Brent Cochran of Tufts University.
If you’re asked to log-in to view visualization dashboards linked below, use username
semviz(all lower case).
- Semantic Visualization of Heng Ji’s Blender Lab Covid-19 Knowledge Graphs from University of Illinois.
- This shows the semantic visualization of the relations in the CORD-19 dataset encoded by Ji’s group as knowledge graphs, between chemical-gene, chemical-disease, and gene-disease. In addition, parameter reduction has been performed on some relations, in order to illustrate heatmaps encoding relations between relations. For example, by encoding a known chemical as a specific gene inhibitor class, this class can be cross correlated with diseases known to interact with just the chemical. This provides a capability for generating inferences over groups or classes of gene or chemical types and their behaviors. Information on navigating the Blender dashboard
- Semantic Visualization of Harvard Protein-protein-causal-assertions (PPCA) dataset.
- This demonstrates several visualizations over data extracted from 32,000 of the 52,000 articles in the CORD-19 dataset by multiple machine reading systems including REACH (University of Arizona) and Sparser (Smart Information Flow Technologies). Extracted events were assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA), developed at Harvard Medical School. Information on navigating the Harvard dashboard
- LAPPSGrid Covid-QA System
- The Language Applications (LAPPS) Grid AskMe document retrieval service provides sophisticated, customizable search capabilities over 14,000 full text articles related to COVID-19 research. The data currently include all articles from the CORD-19 dataset, eliminating duplicates, empty article files, etc. Searches over PubMed, PMC, bioRxiv and medRxiv will be available soon. Updates will be done weekly or when new data warrants.
This research is supported in part by grants from: DARPA grant FA8750-18-2-0016; DARPA grant W911NF-15-C-0238; DTRA grant DTRA-16-1-0002; NSF EAGER grant 1811402; and Andrew W. Mellon Foundation grants G-1901-06505 (with University of Tübingen and Charles University) and G-1810-06248 (with WGBH Media Library and Archive).