Finding Cell-Cell Communication groups using scACCorDiON

James S. Nagai

02.05.2023

Note: Before using ACCorDIoN make sure that you have the CrossTalkeR installed in your local R enviroment(github)

Here we introduce scACCorDIoN. In this notebook, we guide you into the framework steps. %matplotlib inline

Load Libraries

[1]:

from scaccordion import tl as actl
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import pydiffmap as dm
import pandas as pd
import networkx as nx
from skimpy import skim
from sklearn import covariance
import conorm
import phate
import warnings
import kmedoids
import numpy as np
from tqdm import tqdm
import ot
import os
from sklearn import manifold
import itertools as it

np.random.seed(859)
warnings.filterwarnings("ignore")

/home/james/.local/lib/python3.10/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.2' currently installed).
  from pandas.core import (

1. Loading data and metadata

Here, we explore the Pancreas Adenocarcinoma Cohort(Peng et.al.,2019). The data available in scACCorDiON were pre-processed. The Ligand-Receptor inference were performed using the CellphoneDB methods (Efremova et. al., 2020) implemented in LIANA(Dimitrov et. al.,2022). Networks were generated using CrossTalkeR(Nagai et. al., 2021)

[2]:

! wget https://costalab.org/wp-content/uploads/2025/01/PDAC_Data.zip

--2025-01-29 14:34:24--  https://costalab.org/wp-content/uploads/2025/01/PDAC_Data.zip
Resolving costalab.org (costalab.org)... 195.30.85.240
Connecting to costalab.org (costalab.org)|195.30.85.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4004362 (3.8M) [application/zip]
Saving to: ‘PDAC_Data.zip’

PDAC_Data.zip       100%[===================>]   3.82M  --.-KB/s    in 0.1s

2025-01-29 14:34:24 (30.8 MB/s) - ‘PDAC_Data.zip’ saved [4004362/4004362]

[3]:

! unzip  PDAC_Data.zip

Archive:  PDAC_Data.zip
   creating: PDAC_Data/
   creating: PDAC_Data/LR_data/
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N1.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N10.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N11.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N2.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N3.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N4.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N5.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N6.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N7.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N8.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N9.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T1.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T10.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T11.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T12.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T13.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T14.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T15.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T16.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T17.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T18.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T19.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T2.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T20.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T21.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T22.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T23.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T24.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T3.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T4.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T5.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T6.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T7.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T8.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T9.csv
  inflating: PDAC_Data/Peng_PDAC_metaacc.h5ad

1. Read the data and selecting the tables

[4]:

## Reading the data
#path = "/home/james/sciebo/Zenodo_scACCordion/data/ctkerout/Peng_PDAC_processed/"
path = "./PDAC_Data/LR_data/"
#print(lines)
data = {}
for idi in enumerate(os.listdir(path)):
    if idi[1].endswith(".csv"):
            #print(idi)
            i = idi[1]
            data[i[i.find("|")+1:i.find(".csv")]] = pd.read_csv(path+idi[1])
print(data.keys())
## Reading metadata
pdata_metadata = pd.read_hdf("./PDAC_Data/Peng_PDAC_metaacc.h5ad")

dict_keys(['T14', 'N1', 'T12', 'T2', 'T23', 'T22', 'N5', 'N8', 'T6', 'N3', 'T4', 'N11', 'T18', 'T9', 'N2', 'T21', 'T11', 'T16', 'T17', 'T3', 'N10', 'T5', 'N9', 'T15', 'N6', 'T10', 'N7', 'T20', 'T7', 'T24', 'T19', 'T8', 'N4', 'T1', 'T13'])

1.1 Original Labels

[5]:

pdata_metadata.groupby('accLabel').size()

[5]:

accLabel
normal    11
pdac      24
dtype: int64

1.2 Table Format

[6]:

data['N1'].head()

[6]:

	Unnamed: 0	ligand	receptor	source	target	lr_means	type_gene_A	type_gene_B	gene_A	gene_B	MeanLR
0	0	INS	INSR	Endocrine cell	Endothelial cell	4.543838	Ligand	Receptor	INS	INSR	4.543838
1	1	INS	INSR	Endocrine cell	Acinar cell	4.365789	Ligand	Receptor	INS	INSR	4.365789
2	2	INS	INSR	Endocrine cell	Ductal cell type 1	4.270591	Ligand	Receptor	INS	INSR	4.270591
3	3	INS	IGF1R	Endocrine cell	Endocrine cell	4.250901	Ligand	Receptor	INS	IGF1R	4.250901
4	4	INS	INSR	Endocrine cell	Fibroblast cell	4.248840	Ligand	Receptor	INS	INSR	4.248840

2. Now with the tables from CrossTalkeR/Liana we build the Accordion Object

[7]:

def norm(x):
    '''
        Defining the normalization function.
    '''
    nf = conorm.tmm(x+1e-6,trim_lfc=0,trim_mag=0)
    return (nf)

Creating an Accordion Object

[8]:

AaccPDAC = actl.Accordion(tbls=data,weight='lr_means',normf=norm,filter=0.2)

3. Computing a PCA using the edge distribuition per sample

[9]:

AaccPDAC.make_pca()
sns.scatterplot(x=AaccPDAC.Cs['PCA'][0],y=AaccPDAC.Cs['PCA'][1],
                hue=pdata_metadata.loc[AaccPDAC.Cs['PCA'].index,'accLabel'])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel("PC1")
plt.ylabel("PC2")

[9]:

Text(0, 0.5, 'PC2')

../_images/notebooks_PDAC-Learning-distances-and-Clustering_18_1.png

4 Optimal Transport: Wasserstein

In this step we compute the graph-to-graph distance solving the Optimal Transport minimization problem. Note that, the Optimal Transport is done in two main steps: (1) we use the function compute_cost() to compute an edge-to-edge distance and (2) we use the edge weights and the cost to get a graph-to-graph distance matrix using compute_wasserstein()

AaccPDAC.compute_cost_all() ## Run all the possible cost matrices

[10]:

AaccPDAC.compute_cost(mode='HTD') # Computing the cost
AaccPDAC.compute_wassestein(cost='HTD_0.5') # Getting the sample-to-sample distance

Metric Evaluation using K-means and the original labels

[11]:

dt=AaccPDAC.eval_all(y=pdata_metadata.loc[AaccPDAC.p.columns,'accLabel'].astype('category').cat.codes)

[12]:

dt

[12]:

{'HTD_0.5': {'ar': 1.0,
  'r': 1.0,
  'fm': 1.0,
  'afm': 1.0,
  'mi': 0.6224869206570488,
  'nmi': 1.0,
  'ami': 1.0,
  'npa': 1.0,
  'psi': 1.0,
  'spsi': 1.0,
  'nca': 1.0}}

5 Clustering using the graph-to-graph distance

Although K-medoids can also be used for clustering scACCorDiON patient-patient distance matrix, in the tutorial we focus in the usage of the K-Barycenters. This usage is motivated by the latent variables infered by K-Barycenters.

km = kmedoids.KMedoids(n_clusters=3, method='fasterpam')
plt.figure(figsize=(6,5))
c = km.fit(AaccPDAC.wdist['HTD_0.5'].to_numpy())

5.1.1 Here we compute the K-barycenters

[13]:

X =np.round(AaccPDAC.wdist['HTD_0.5'].loc[AaccPDAC.wdist['HTD_0.5'].index,AaccPDAC.wdist['HTD_0.5'].index],5)
clust,loss = [],[]
for _ in tqdm(range(100)):
    kmeans = actl.KBarycenters(k=3,init='++',random_state=_,max_iters=100)
    kmeans.fit(X,distr=AaccPDAC.p,cost=AaccPDAC.Cs['HTD_0.5'])
    clust.append(kmeans)
    loss.append(kmeans.err[-1])
ncls= clust[np.argmin(loss)]

100%|█████████████████████████████████████████| 100/100 [01:02<00:00,  1.59it/s]

5.1.2 Updating the distance matrix with the Barycenters

[14]:

aux = ncls.centroids.to_dict()
for i in range(3):
    for j in range(i,3):
        if i!=j:
            aux[i][j] = ot.emd2(a=ncls.bary[i]/ncls.bary[i].sum(),
                                b=ncls.bary[j]/ncls.bary[j].sum(),M=AaccPDAC.Cs['HTD_0.5'])
            aux[j][i] = ot.emd2(a=ncls.bary[i]/ncls.bary[i].sum(),
                                b=ncls.bary[j]/ncls.bary[j].sum(),M=AaccPDAC.Cs['HTD_0.5'])
        else:
            aux[i][j] = 0
eux = pd.concat([AaccPDAC.wdist['HTD_0.5'],ncls.centroids],axis=1)
eux = pd.concat([eux,pd.DataFrame.from_dict(aux).T])
tmplab = ncls.flabels.tolist()
for i in [0,1,2]:
    tmplab.append(i)

Defining the barycenters

[15]:

tmpstl = ['0']*len(AaccPDAC.p.columns)+['1']*3

5.1.3 Computing the new embeddings and plotting the new data representation

[16]:

import phate

[17]:

%matplotlib agg
emb1 = phate.PHATE(knn=8,knn_dist='precomputed',random_state=42)
emb1 = emb1.fit_transform(eux.to_numpy())
fs,axs = plt.subplots(1,2,figsize=(14,5))

Calculating PHATE...
  Running PHATE on precomputed distance matrix with 38 observations.
  Calculating graph and diffusion operator...
    Calculating affinities...
  Calculating optimal t...
    Automatically selected t = 5
  Calculated optimal t in 0.01 seconds.
  Calculating diffusion potential...
  Calculating metric MDS...
Calculated PHATE in 0.03 seconds.

[18]:

barys = pd.DataFrame(ncls.bary,index=AaccPDAC.p.loc[list(AaccPDAC.expgraph.nodes()),:].index)
barys.reset_index(inplace=True)
barys[['u', 'v']] = barys['index'].str.split('$', expand=True)
cmapse =[plt.cm.Greens,plt.cm.Blues,plt.cm.Reds]
vtxcmap = {i[1]:plt.cm.tab10.colors[i[0]] for i in enumerate(AaccPDAC.nodes)}

[19]:

sns.scatterplot(x=emb1[0:-3,0],
                y=emb1[0:-3,1],hue=pdata_metadata.loc[AaccPDAC.Cs['PCA'].index,'accLabel'],s=200,linewidth=0,ax=axs[0])
axs[0].legend(loc=0, bbox_to_anchor=(1, 0.5))
axs[0].invert_xaxis()
sns.scatterplot(x=emb1[:,0],
                y=emb1[:,1],hue=tmplab,s=200,style=tmpstl,palette=list(plt.cm.Set1.colors)[0:3][::-1],
                size=tmpstl,sizes=(400,200),linewidth=0,ax=axs[1])
axs[1].legend(loc=0, bbox_to_anchor=(1, 0.5))
#axs[1].invert_xaxis()
fs

[19]:

../_images/notebooks_PDAC-Learning-distances-and-Clustering_40_0.png