DisGeNET
Dataset Description
DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. TDC uses the curated subset from UNIPROT, CGI, ClinGen, Genomics England, CTD (human subset), PsyGeNET, and Orphanet. TDC maps disease ID to disease definition through MedGen and maps GeneID to uniprot amino acid sequence.
Task Description
Regression. Given the disease description and the amino acid sequence of the gene, predict their association.
Dataset Statistics
52,476 gene-disease pairs, 7,399 genes, 7,095 diseases
Available Splits
Usage Example
from tdc_ml.multi_pred import GDA data = GDA(name='DisGeNET') # Access the data df = data.get_data() print(df.head()) # Get train/val/test splits split = data.get_split() print(split)
References
License
This dataset is licensed under CC BY-NC-SA 4.0.