Finding Cell-Cell Communication groups using scACCorDiON

James S. Nagai

02.05.2023

Note: Before using ACCorDIoN make sure that you have the CrossTalkeR installed in your local R enviroment(github)

Here we introduce scACCorDIoN. In this notebook, we guide you into the framework steps. %matplotlib inline

Load Libraries

[1]:
from scaccordion import tl as actl
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import pydiffmap as dm
import pandas as pd
import networkx as nx
from skimpy import skim
from sklearn import covariance
import conorm
import phate
import warnings
import kmedoids
import numpy as np
from tqdm import tqdm
import ot
import os
from sklearn import manifold
import itertools as it

np.random.seed(859)
warnings.filterwarnings("ignore")
/home/james/.local/lib/python3.10/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.2' currently installed).
  from pandas.core import (

1. Loading data and metadata

Here, we explore the Pancreas Adenocarcinoma Cohort(Peng et.al.,2019). The data available in scACCorDiON were pre-processed. The Ligand-Receptor inference were performed using the CellphoneDB methods (Efremova et. al., 2020) implemented in LIANA(Dimitrov et. al.,2022). Networks were generated using CrossTalkeR(Nagai et. al., 2021)

[2]:
! wget https://costalab.org/wp-content/uploads/2025/01/PDAC_Data.zip
--2025-01-29 14:34:24--  https://costalab.org/wp-content/uploads/2025/01/PDAC_Data.zip
Resolving costalab.org (costalab.org)... 195.30.85.240
Connecting to costalab.org (costalab.org)|195.30.85.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4004362 (3.8M) [application/zip]
Saving to: ‘PDAC_Data.zip’

PDAC_Data.zip       100%[===================>]   3.82M  --.-KB/s    in 0.1s

2025-01-29 14:34:24 (30.8 MB/s) - ‘PDAC_Data.zip’ saved [4004362/4004362]

[3]:
! unzip  PDAC_Data.zip
Archive:  PDAC_Data.zip
   creating: PDAC_Data/
   creating: PDAC_Data/LR_data/
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N1.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N10.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N11.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N2.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N3.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N4.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N5.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N6.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N7.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N8.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|N9.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T1.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T10.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T11.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T12.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T13.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T14.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T15.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T16.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T17.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T18.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T19.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T2.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T20.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T21.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T22.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T23.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T24.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T3.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T4.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T5.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T6.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T7.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T8.csv
  inflating: PDAC_Data/LR_data/ct_Peng_PDAC_processed|T9.csv
  inflating: PDAC_Data/Peng_PDAC_metaacc.h5ad

1. Read the data and selecting the tables

[4]:
## Reading the data
#path = "/home/james/sciebo/Zenodo_scACCordion/data/ctkerout/Peng_PDAC_processed/"
path = "./PDAC_Data/LR_data/"
#print(lines)
data = {}
for idi in enumerate(os.listdir(path)):
    if idi[1].endswith(".csv"):
            #print(idi)
            i = idi[1]
            data[i[i.find("|")+1:i.find(".csv")]] = pd.read_csv(path+idi[1])
print(data.keys())
## Reading metadata
pdata_metadata = pd.read_hdf("./PDAC_Data/Peng_PDAC_metaacc.h5ad")
dict_keys(['T14', 'N1', 'T12', 'T2', 'T23', 'T22', 'N5', 'N8', 'T6', 'N3', 'T4', 'N11', 'T18', 'T9', 'N2', 'T21', 'T11', 'T16', 'T17', 'T3', 'N10', 'T5', 'N9', 'T15', 'N6', 'T10', 'N7', 'T20', 'T7', 'T24', 'T19', 'T8', 'N4', 'T1', 'T13'])

1.1 Original Labels

[5]:
pdata_metadata.groupby('accLabel').size()
[5]:
accLabel
normal    11
pdac      24
dtype: int64

1.2 Table Format

[6]:
data['N1'].head()
[6]:
Unnamed: 0 ligand receptor source target lr_means cellphone_pvals type_gene_A type_gene_B gene_A gene_B MeanLR
0 0 INS INSR Endocrine cell Endothelial cell 4.543838 0.0 Ligand Receptor INS INSR 4.543838
1 1 INS INSR Endocrine cell Acinar cell 4.365789 0.0 Ligand Receptor INS INSR 4.365789
2 2 INS INSR Endocrine cell Ductal cell type 1 4.270591 0.0 Ligand Receptor INS INSR 4.270591
3 3 INS IGF1R Endocrine cell Endocrine cell 4.250901 0.0 Ligand Receptor INS IGF1R 4.250901
4 4 INS INSR Endocrine cell Fibroblast cell 4.248840 0.0 Ligand Receptor INS INSR 4.248840

2. Now with the tables from CrossTalkeR/Liana we build the Accordion Object

[7]:
def norm(x):
    '''
        Defining the normalization function.
    '''
    nf = conorm.tmm(x+1e-6,trim_lfc=0,trim_mag=0)
    return (nf)

Creating an Accordion Object

[8]:
AaccPDAC = actl.Accordion(tbls=data,weight='lr_means',normf=norm,filter=0.2)

3. Computing a PCA using the edge distribuition per sample

[9]:
AaccPDAC.make_pca()
sns.scatterplot(x=AaccPDAC.Cs['PCA'][0],y=AaccPDAC.Cs['PCA'][1],
                hue=pdata_metadata.loc[AaccPDAC.Cs['PCA'].index,'accLabel'])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel("PC1")
plt.ylabel("PC2")

[9]:
Text(0, 0.5, 'PC2')
../_images/notebooks_PDAC-Learning-distances-and-Clustering_18_1.png

4 Optimal Transport: Wasserstein

In this step we compute the graph-to-graph distance solving the Optimal Transport minimization problem. Note that, the Optimal Transport is done in two main steps: (1) we use the function compute_cost() to compute an edge-to-edge distance and (2) we use the edge weights and the cost to get a graph-to-graph distance matrix using compute_wasserstein()

AaccPDAC.compute_cost_all() ## Run all the possible cost matrices
[10]:
AaccPDAC.compute_cost(mode='HTD') # Computing the cost
AaccPDAC.compute_wassestein(cost='HTD_0.5') # Getting the sample-to-sample distance

Metric Evaluation using K-means and the original labels

[11]:
dt=AaccPDAC.eval_all(y=pdata_metadata.loc[AaccPDAC.p.columns,'accLabel'].astype('category').cat.codes)
[12]:
dt
[12]:
{'HTD_0.5': {'ar': 1.0,
  'r': 1.0,
  'fm': 1.0,
  'afm': 1.0,
  'mi': 0.6224869206570488,
  'nmi': 1.0,
  'ami': 1.0,
  'npa': 1.0,
  'psi': 1.0,
  'spsi': 1.0,
  'nca': 1.0}}

5 Clustering using the graph-to-graph distance

Although K-medoids can also be used for clustering scACCorDiON patient-patient distance matrix, in the tutorial we focus in the usage of the K-Barycenters. This usage is motivated by the latent variables infered by K-Barycenters.

km = kmedoids.KMedoids(n_clusters=3, method='fasterpam')
plt.figure(figsize=(6,5))
c = km.fit(AaccPDAC.wdist['HTD_0.5'].to_numpy())

5.1.1 Here we compute the K-barycenters

[13]:
X =np.round(AaccPDAC.wdist['HTD_0.5'].loc[AaccPDAC.wdist['HTD_0.5'].index,AaccPDAC.wdist['HTD_0.5'].index],5)
clust,loss = [],[]
for _ in tqdm(range(100)):
    kmeans = actl.KBarycenters(k=3,init='++',random_state=_,max_iters=100)
    kmeans.fit(X,distr=AaccPDAC.p,cost=AaccPDAC.Cs['HTD_0.5'])
    clust.append(kmeans)
    loss.append(kmeans.err[-1])
ncls= clust[np.argmin(loss)]
100%|█████████████████████████████████████████| 100/100 [01:02<00:00,  1.59it/s]

5.1.2 Updating the distance matrix with the Barycenters

[14]:
aux = ncls.centroids.to_dict()
for i in range(3):
    for j in range(i,3):
        if i!=j:
            aux[i][j] = ot.emd2(a=ncls.bary[i]/ncls.bary[i].sum(),
                                b=ncls.bary[j]/ncls.bary[j].sum(),M=AaccPDAC.Cs['HTD_0.5'])
            aux[j][i] = ot.emd2(a=ncls.bary[i]/ncls.bary[i].sum(),
                                b=ncls.bary[j]/ncls.bary[j].sum(),M=AaccPDAC.Cs['HTD_0.5'])
        else:
            aux[i][j] = 0
eux = pd.concat([AaccPDAC.wdist['HTD_0.5'],ncls.centroids],axis=1)
eux = pd.concat([eux,pd.DataFrame.from_dict(aux).T])
tmplab = ncls.flabels.tolist()
for i in [0,1,2]:
    tmplab.append(i)

Defining the barycenters

[15]:
tmpstl = ['0']*len(AaccPDAC.p.columns)+['1']*3

5.1.3 Computing the new embeddings and plotting the new data representation

[16]:
import phate
[17]:
%matplotlib agg
emb1 = phate.PHATE(knn=8,knn_dist='precomputed',random_state=42)
emb1 = emb1.fit_transform(eux.to_numpy())
fs,axs = plt.subplots(1,2,figsize=(14,5))
Calculating PHATE...
  Running PHATE on precomputed distance matrix with 38 observations.
  Calculating graph and diffusion operator...
    Calculating affinities...
  Calculating optimal t...
    Automatically selected t = 5
  Calculated optimal t in 0.01 seconds.
  Calculating diffusion potential...
  Calculating metric MDS...
Calculated PHATE in 0.03 seconds.
[18]:
barys = pd.DataFrame(ncls.bary,index=AaccPDAC.p.loc[list(AaccPDAC.expgraph.nodes()),:].index)
barys.reset_index(inplace=True)
barys[['u', 'v']] = barys['index'].str.split('$', expand=True)
cmapse =[plt.cm.Greens,plt.cm.Blues,plt.cm.Reds]
vtxcmap = {i[1]:plt.cm.tab10.colors[i[0]] for i in enumerate(AaccPDAC.nodes)}
[19]:
sns.scatterplot(x=emb1[0:-3,0],
                y=emb1[0:-3,1],hue=pdata_metadata.loc[AaccPDAC.Cs['PCA'].index,'accLabel'],s=200,linewidth=0,ax=axs[0])
axs[0].legend(loc=0, bbox_to_anchor=(1, 0.5))
axs[0].invert_xaxis()
sns.scatterplot(x=emb1[:,0],
                y=emb1[:,1],hue=tmplab,s=200,style=tmpstl,palette=list(plt.cm.Set1.colors)[0:3][::-1],
                size=tmpstl,sizes=(400,200),linewidth=0,ax=axs[1])
axs[1].legend(loc=0, bbox_to_anchor=(1, 0.5))
#axs[1].invert_xaxis()
fs
[19]:
../_images/notebooks_PDAC-Learning-distances-and-Clustering_40_0.png