Curate entity identifiers#

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard. Bionty enables this by curating data against the versionized ontologies using curate().

We’ll demonstrate this by first curating genes and second CellMarkers where not all values can be immediately mapped.

Let’s start by importing the required modules from Bionty and Pandas.

from bionty import Gene, CellMarker, lookup
import pandas as pd

Curating genes#

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "corrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")

df_orig

	gene symbol	hgnc id
ensembl_gene_id
ENSG00000148584	A1CF	HGNC:24086
ENSG00000121410	A1BG	HGNC:5
ENSG00000188389	FANCD1	HGNC:1101
corrupted	corrupted	corrupted

We require a reference identifier (specified as the reference_id parameter for curate). The list can be looked up using lookup(). Examples are “ontology_id”, which corresponds to the IDs of the ontology terms (e.g. ‘ENSG00000148584’) or “name” which corresponds to the ontology term names (e.g. ‘A1CF’).

lookup.gene_id

feature(mgi_id='mgi_id', ncbi_gene_id='ncbi_gene_id', omim_id='omim_id', symbol='symbol', hgnc_id='hgnc_id', gene_type='gene_type', ensembl_transcript_id='ensembl_transcript_id', synonyms='synonyms', ensembl_gene_id='ensembl_gene_id', description='description', ensembl_protein_id='ensembl_protein_id')

To curate the DataFrame into queryable form, we create an index that corresponds to a default identifier. By default we use ensembl_gene_id. The default behavior is to curate the index if a column name is not provided.

Gene().curate(df_orig)

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

	gene symbol	hgnc id	orig_index	__curated__
ensembl_gene_id
ENSG00000148584	A1CF	HGNC:24086	ENSG00000148584	True
ENSG00000121410	A1BG	HGNC:5	ENSG00000121410	True
ENSG00000188389	FANCD1	HGNC:1101	ENSG00000188389	True
corrupted	corrupted	corrupted	corrupted	False

The curated DataFrame has now been reindexed by the curated cell types. A new column orig_index containing the original index has been added. Furthermore, a new column __curated__ containing booleans of whether the data could be successfully curated or not has been added.

You may provide a column name to curate a specific column against a reference identifier.

Gene().curate(df_orig, column="hgnc id", reference_id=lookup.gene_id.hgnc_id)

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

	gene symbol	hgnc id	ensembl_gene_id	__curated__
hgnc_id
13736.0	A1CF	HGNC:24086	ENSG00000148584	True
24881.0	A1BG	HGNC:5	ENSG00000121410	True
17731.0	FANCD1	HGNC:1101	ENSG00000188389	True
corrupted	corrupted	corrupted	corrupted	False

When mapping symbols, the function will automatically convert the aliases into standardized symbols. In this example, PD-1 is converted into PDCD1.

Gene().curate(df_orig, column="gene symbol", reference_id=lookup.gene_id.symbol)

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

	gene symbol	hgnc id	ensembl_gene_id	__curated__
symbol
13736.0	A1CF	HGNC:24086	ENSG00000148584	True
24881.0	A1BG	HGNC:5	ENSG00000121410	True
17731.0	FANCD1	HGNC:1101	ENSG00000188389	True
corrupted	corrupted	corrupted	corrupted	False

This is synonymous to:

Gene().curate(df_orig, column="gene symbol", reference_id="symbol")

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

	gene symbol	hgnc id	ensembl_gene_id	__curated__
symbol
13736.0	A1CF	HGNC:24086	ENSG00000148584	True
24881.0	A1BG	HGNC:5	ENSG00000121410	True
17731.0	FANCD1	HGNC:1101	ENSG00000188389	True
corrupted	corrupted	corrupted	corrupted	False

Match (unmappable) cell markers to the reference#

Depending on how the data was collected and which terminology was used, it is not always possible to curate the values. Some values might have used a different standard or are simply corrupted.

This section will demonstrate how to look up unmatched terms and curating them using The CellMarker entity. First, we create an example Pandas DataFrame containing a few valid and invalid cell markers (antibody targets) and features (Time) from a flow cytometry dataset.

markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7x",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let’s instantiate the CellMarker ontology with the default database and version.

cell_marker = CellMarker()

First, we can have a look at the cell marker table that we just loaded.

df = cell_marker.df()

df.head()

	id	name	ncbi_gene_id	gene_symbol	gene_name	uniprotkb_id
0	CM_MERTK	MERTK	10461	MERTK	MER proto-oncogene, tyrosine kinase	Q12866
1	CM_CD16	CD16	2215	FCGR3A	Fc fragment of IgG receptor IIIb	O75015
2	CM_CD206	CD206	4360	MRC1	mannose receptor C-type 1	P22897
3	CM_CRIg	CRIg	11326	VSIG4	V-set and immunoglobulin domain containing 4	Q9Y279
4	CM_CD163	CD163	9332	CD163	CD163 molecule	Q86VB7

Now let’s check which cell markers from the file can be found in the reference. We do this using the .curate function:

cell_marker.curate(markers)

✅ 10 terms (71.4%) are mapped.

🔶 4 terms (28.6%) are not mapped.

	orig_index	__curated__
KI67	KI67	True
CCR7x	CCR7x	False
CD14	CD14	True
CD8	CD8	True
CD45RA	CD45RA	True
CD4	CD4	True
CD3	CD3	True
CD127	CD127	True
PD1	PD1	True
Invalid-1	Invalid-1	False
Invalid-2	Invalid-2	False
CD66b	CD66b	True
Siglec8	Siglec8	True
Time	Time	False

From the logging, it can be seen that 4 terms were not found in the reference!

Among them Time, Invalid-1 and Invalid-2 are a non-marker channel which won’t be curated by cell marker.

However, some markers such as “CD66b” and “Siglec8” are valid but not purely upper-case.

The markers in reference table are case sensitive by default so let’s try to turn off case sensitivity:

cell_marker.curate(markers, case_sensitive=False)

✅ 10 terms (71.4%) are mapped.

🔶 4 terms (28.6%) are not mapped.

	orig_index	__curated__
KI67	KI67	True
CCR7X	CCR7x	False
CD14	CD14	True
CD8	CD8	True
CD45RA	CD45RA	True
CD4	CD4	True
CD3	CD3	True
CD127	CD127	True
PD1	PD1	True
INVALID-1	Invalid-1	False
INVALID-2	Invalid-2	False
CD66B	CD66b	True
SIGLEC8	Siglec8	True
TIME	Time	False

OK, great, we are down to 4 unmatched terms (3 non-markers)!

We don’t really find CCR7x, let’s check in the lookup with auto-completion:

lookup = cell_marker.lookup()

https://d33wubrfki0l68.cloudfront.net/eee08aab484a13dbaefc78633d1805ee61cd933c/8d864/_images/lookup_ccr7.png

lookup.CCR7

cell_marker(index=184, id='CM_CCR7', name='CCR7', ncbi_gene_id='1236', gene_symbol='CCR7', gene_name='C-C motif chemokine receptor 7', uniprotkb_id='P32248')

Indeed we find it should be CCR7, we had a typo there with CCR7x.

Now let’s fix the markers so all of them can be linked:

Tip

Using the .lookup instead of passing a string helps eliminate possible typos!

curated_df = markers.rename(index={"CCR7x": lookup.CCR7.name})

OK, now we can try to run curate again and all cell markers are linked!

cell_marker.curate(curated_df)

✅ 11 terms (78.6%) are mapped.

🔶 3 terms (21.4%) are not mapped.

	orig_index	__curated__
KI67	KI67	True
CCR7	CCR7	True
CD14	CD14	True
CD8	CD8	True
CD45RA	CD45RA	True
CD4	CD4	True
CD3	CD3	True
CD127	CD127	True
PD1	PD1	True
Invalid-1	Invalid-1	False
Invalid-2	Invalid-2	False
CD66b	CD66b	True
Siglec8	Siglec8	True
Time	Time	False