Look up records of species, gene, protein, cell marker#
Entities and ontologies can be complex with many different identifiers or even species.
Here we show Bionty’s Entity model for species, genes, proteins and cell markers. You’ll see how to
initialize an Entity model with different identifiers
access the reference table via
.df
lookup an entity record via
.lookup.{term}
import bionty as bt
Species#
To examine the Species ontology we create the corresponding object and look at the associated Pandas DataFrame.
species = bt.Species()
Reference table#
df = species.df()
df.head()
id | name | scientific_name | division | taxon_id | assembly | assembly_accession | genebuild | variation | microarray | pan_compara | peptide_compara | genome_alignments | other_alignments | core_db | species_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NCBI_80966 | spiny chromis | acanthochromis_polyacanthus | EnsemblVertebrates | 80966 | ASM210954v1 | GCA_002109545.1 | 2018-05-Ensembl/2020-03 | N | N | N | Y | Y | Y | acanthochromis_polyacanthus_core_108_1 | 1 |
1 | NCBI_211598 | eurasian sparrowhawk | accipiter_nisus | EnsemblVertebrates | 211598 | Accipiter_nisus_ver1.0 | GCA_004320145.1 | 2019-07-Ensembl/2019-09 | N | N | N | N | N | Y | accipiter_nisus_core_108_1 | 1 |
2 | NCBI_9646 | giant panda | ailuropoda_melanoleuca | EnsemblVertebrates | 9646 | ASM200744v2 | GCA_002007445.2 | 2020-05-Ensembl/2020-06 | N | N | N | Y | Y | Y | ailuropoda_melanoleuca_core_108_2 | 1 |
3 | NCBI_241587 | yellow-billed parrot | amazona_collaria | EnsemblVertebrates | 241587 | ASM394721v1 | GCA_003947215.1 | 2019-07-Ensembl/2019-09 | N | N | N | N | N | Y | amazona_collaria_core_108_1 | 1 |
4 | NCBI_61819 | midas cichlid | amphilophus_citrinellus | EnsemblVertebrates | 61819 | Midas_v5 | GCA_000751415.1 | 2018-05-Ensembl/2018-07 | N | N | N | Y | Y | Y | amphilophus_citrinellus_core_108_5 | 1 |
Lookup records#
Terms can be searched with auto-complete using a lookup object:
Tip
By default, the name
field is used to generate the lookup, you may change the field via:
species.lookup_field = <new field>
For duplications, we uniquefy them by appending __0
, __1
, __2
, …
lookup = species.lookup()
lookup.white_tufted_ear_marmoset
species(index=37, id='NCBI_9483', name='white-tufted-ear marmoset', scientific_name='callithrix_jacchus', division='EnsemblVertebrates', taxon_id=9483, assembly='mCalJac1.pat.X', assembly_accession='GCA_011100555.1', genebuild='2020-08-Ensembl/2020-11', variation='N', microarray='Y', pan_compara='N', peptide_compara='Y', genome_alignments='Y', other_alignments='Y', core_db='callithrix_jacchus_core_108_1', species_id=1)
lookup.white_tufted_ear_marmoset.scientific_name
'callithrix_jacchus'
To access the information of, for example the human, pig, and mouse species, we select the corresponding species through Pandas:
df = species.df()
df.set_index("name", inplace=True)
df.loc[["human", "mouse", "pig"]]
id | scientific_name | division | taxon_id | assembly | assembly_accession | genebuild | variation | microarray | pan_compara | peptide_compara | genome_alignments | other_alignments | core_db | species_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||
human | NCBI_9606 | homo_sapiens | EnsemblVertebrates | 9606 | GRCh38.p13 | GCA_000001405.28 | 2014-01-Ensembl/2022-07 | Y | Y | Y | Y | Y | Y | homo_sapiens_core_108_38 | 1 |
mouse | NCBI_10090 | mus_musculus | EnsemblVertebrates | 10090 | GRCm39 | GCA_000001635.9 | 2020-08-Ensembl/2022-07 | Y | Y | Y | Y | Y | Y | mus_musculus_core_108_39 | 1 |
pig | NCBI_9823 | sus_scrofa | EnsemblVertebrates | 9823 | Sscrofa11.1 | GCA_000003025.6 | 2021-09-Ensembl/2022-02 | Y | Y | N | Y | Y | Y | sus_scrofa_core_108_111 | 1 |
Gene#
Next let’s take a look at genes, which follows the same design choices as Species
.
The only difference is the Gene
class will initialize with a species
parameter, therefore you will only retrieve gene entries of the specified species.
gene = bt.Gene(species="human")
df = gene.df()
df.head()
id | ensembl_gene_id | symbol | gene_type | description | ncbi_gene_id | hgnc_id | omim_id | synonyms | version | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Lzl9xt | ENSG00000210049 | MT-TF | Mt_tRNA | mitochondrially encoded tRNA-Phe (UUU/C) [Sour... | None | HGNC:7481 | None | MTTF|trnF | Ens107 |
1 | ILAWa7 | ENSG00000211459 | MT-RNR1 | Mt_rRNA | mitochondrially encoded 12S rRNA [Source:HGNC ... | None | HGNC:7470 | None | 12S|MOTS-c|MTRNR1 | Ens107 |
2 | XkyeQz | ENSG00000210077 | MT-TV | Mt_tRNA | mitochondrially encoded tRNA-Val (GUN) [Source... | None | HGNC:7500 | None | MTTV|trnV | Ens107 |
3 | jDD2jW | ENSG00000210082 | MT-RNR2 | Mt_rRNA | mitochondrially encoded 16S rRNA [Source:HGNC ... | None | HGNC:7471 | None | 16S|HN|MTRNR2 | Ens107 |
4 | J58H9b | ENSG00000209082 | MT-TL1 | Mt_tRNA | mitochondrially encoded tRNA-Leu (UUA/G) 1 [So... | None | HGNC:7490 | None | MTTL1|TRNL1 | Ens107 |
lookup = gene.lookup()
lookup.TCF7
gene(index=20388, id='sXCrmQ', ensembl_gene_id='ENSG00000081059', symbol='TCF7', gene_type='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', ncbi_gene_id='6932', hgnc_id='HGNC:11639', omim_id='189908', synonyms='TCF-1', version='Ens107')
Convert between identifiers just using Pandas:
df.loc[df["symbol"].isin(["BRCA1", "BRCA2"])]
id | ensembl_gene_id | symbol | gene_type | description | ncbi_gene_id | hgnc_id | omim_id | synonyms | version | |
---|---|---|---|---|---|---|---|---|---|---|
17731 | nLEreh | ENSG00000139618 | BRCA2 | protein_coding | BRCA2 DNA repair associated [Source:HGNC Symbo... | 675 | HGNC:1101 | 600185 | BRCC2|FACD|FAD|FAD1|FANCD|FANCD1|XRCC11 | Ens107 |
63779 | 9FY8yO | ENSG00000012048 | BRCA1 | protein_coding | BRCA1 DNA repair associated [Source:HGNC Symbo... | 672 | HGNC:1100 | 113705 | BRCC1|FANCS|PPP1R53|RNF53 | Ens107 |
The mouse reference is also available from ensembl:
gene = bt.Gene("mouse")
df = gene.df()
df.head()
id | ensembl_gene_id | symbol | gene_type | description | ncbi_gene_id | mgi_id | synonyms | version | |
---|---|---|---|---|---|---|---|---|---|
0 | Epd98t | ENSMUSG00000064336 | mt-Tf | Mt_tRNA | mitochondrially encoded tRNA phenylalanine [So... | None | MGI:102487 | tRNA|tRNA-Phe|TrnF tRNA | Ens107 |
1 | RiOxA6 | ENSMUSG00000064337 | mt-Rnr1 | Mt_rRNA | mitochondrially encoded 12S rRNA [Source:MGI S... | None | MGI:102493 | 12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA | Ens107 |
2 | cMIElg | ENSMUSG00000064338 | mt-Tv | Mt_tRNA | mitochondrially encoded tRNA valine [Source:MG... | None | MGI:102472 | tRNA|tRNA-Val|TrnaV tRNA | Ens107 |
3 | DbiNNA | ENSMUSG00000064339 | mt-Rnr2 | Mt_rRNA | mitochondrially encoded 16S rRNA [Source:MGI S... | None | MGI:102492 | 16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri... | Ens107 |
4 | NO6NBF | ENSMUSG00000064340 | mt-Tl1 | Mt_tRNA | mitochondrially encoded tRNA leucine 1 [Source... | None | MGI:102482 | tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA | Ens107 |
Protein#
The protein reference uses UniProt id as the standardized identifier.
protein = bt.Protein(species="human")
lookup = protein.lookup()
lookup.ABC_transporter_domain_containing_protein
protein(index=197375, id='7Hevwtc', uniprotkb_id='Q9BV39', uniprotkb_name='Q9BV39_HUMAN', synonyms='ABC transporter domain-containing protein', length=316, species_id=9606, gene_symbols=None, gene_synonyms=None, ensembl_transcript_ids=None, ncbi_gene_ids=None, name='ABC transporter domain-containing protein')
df = protein.df()
df.head()
id | uniprotkb_id | uniprotkb_name | synonyms | length | species_id | gene_symbols | gene_synonyms | ensembl_transcript_ids | ncbi_gene_ids | name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1zrr8Wy | A0A024QZ08 | A0A024QZ08_HUMAN | Intraflagellar transport 20 homolog (Chlamydom... | 132 | 9606 | IFT20 | None | None | 90410; | isoform CRA_c |
1 | xNgxtFu | A0A024QZ86 | A0A024QZ86_HUMAN | T-box 2|isoform CRA_a | 712 | 9606 | TBX2 | None | None | 6909; | T-box 2 |
2 | X9K8OgK | A0A024QZA8 | A0A024QZA8_HUMAN | Receptor protein-tyrosine kinase|EC 2.7.10.1 | 976 | 9606 | EPHA2 | None | None | 1969; | EC 2.7.10.1 |
3 | 8jW9Ci4 | A0A024QZB8 | A0A024QZB8_HUMAN | Battenin | 438 | 9606 | CLN3 | None | None | 1201; | Battenin |
4 | nZNsA6F | A0A024QZQ1 | A0A024QZQ1_HUMAN | Sirtuin (Silent mating type information regula... | 747 | 9606 | SIRT1 | None | None | 23411; | isoform CRA_a |
Cell marker#
The cell marker ontologies works similarly.
cell_marker = bt.CellMarker(species="human")
df = cell_marker.df()
df.head()
id | name | ncbi_gene_id | gene_symbol | gene_name | uniprotkb_id | |
---|---|---|---|---|---|---|
0 | CM_MERTK | MERTK | 10461 | MERTK | MER proto-oncogene, tyrosine kinase | Q12866 |
1 | CM_CD16 | CD16 | 2215 | FCGR3A | Fc fragment of IgG receptor IIIb | O75015 |
2 | CM_CD206 | CD206 | 4360 | MRC1 | mannose receptor C-type 1 | P22897 |
3 | CM_CRIg | CRIg | 11326 | VSIG4 | V-set and immunoglobulin domain containing 4 | Q9Y279 |
4 | CM_CD163 | CD163 | 9332 | CD163 | CD163 molecule | Q86VB7 |
lookup = cell_marker.lookup()
lookup.CD45
cell_marker(index=35, id='CM_CD45', name='CD45', ncbi_gene_id='5788', gene_symbol='PTPRC', gene_name='protein tyrosine phosphatase receptor type C', uniprotkb_id='M9MML4')