Doublets of cells are is a common feature in scRNASeq where cells in the same well/droplet are sequenced together. They are either homotypic (from the same cell type) or heterotypic (from different cell types). Dependent on the technology, there can be an expected proportion of doublets. In 10x technology this proportion is linearly correlated with the number of loaded cells.
One of the approaches to detect doublets, is by simulating doublets. Artificial doublets are randomly subsampled in pairs and their gene expression is averaged to obtain doublets counts. The new artifical doublets are projected together with the original cells into a lower dimensional principle space. Then a doublet score based on numnber of artificial doublet neighbors is computed in the k nearest neighbour graph.
Most such packages need an assumption about the number/proportion of expected doublets in the dataset. The data you are using is subsampled, but the original datasets contained about 5 000 cells per sample, hence we can assume that they loaded about 9 000 cells and should have a doublet rate at about 4%.
scrublet
is a tool which is used to identify doublets. This tool works with raw counts saved in adata.X
.
Note: that doublet detection should be done for each sample separately.
To do that we will split the data in a loop and run scrublet
for each sample and add the detected doublets to the adata
object.
Artificial doublets are randomly subsampled in pairs and their gene expression is averaged to obtain doublets counts. The new artifical doublets are projected together with the original cells into a lower dimensional principle space. Then a doublet score based on numnber of artificial doublet neighbors is computed in the k nearest neighbour graph.
import scrublet as scr
# split per batch into new objects.
batches = adata.obs['sample'].cat.categories.tolist()
alldata = {}
for batch in batches:
tmp = adata[adata.obs['sample'] == batch,]
print(batch, ":", tmp.shape[0], " cells")
scrub = scr.Scrublet(tmp.X)
out = scrub.scrub_doublets(verbose=False, n_prin_comps = 20)
alldata[batch] = pd.DataFrame({'doublet_score':out[0],'detected_doublets':out[1]},index = tmp.obs.index)
print(alldata[batch].detected_doublets.sum(), " detected_doublets")
# add predictions to the adata object.
scrub_pred = pd.concat(alldata.values())
adata.obs['doublet_scores'] = scrub_pred['doublet_score']
adata.obs['detected_doublets'] = scrub_pred['detected_doublets']
print('Total doublets:' , sum(adata.obs['detected_doublets']))
When two cells are sequenced with one cell barcode, we should expect that there is high number of detected cells under single cell (barcode).
# add in column with singlet/doublet instead of True/Fals
%matplotlib inline
adata.obs['doublet_info'] = adata.obs["detected_doublets"].astype(str)
sc.pl.violin(adata, 'n_genes_by_counts', jitter=0.4, groupby = 'doublet_info', rotation=45)
adata.write(os.path.join(path_results, "covid_filtered_removed_doublets.h5ad"))
In order to show the distribution of doublets among all the cells, we will do the following step:
- Normalize and scale the data.
- Dimensionality reduction (you will read further below).
- Creating UMAP (you will read further below).
Note: When you normalize the data the data
slot will be overwritten with the normalized counts.
In order to have a copy of the raw data you can save it in a new variable (adata.raw
).
If you want to identify the distribution of doublet among cells in reduced dimenstional space, run the following chunk.
We will do this in dionsionality reduction step, further down.
adata = sc.read_h5ad(os.path.join(path_results, "covid_filtered_removed_doublets.h5ad"))
# save normalized counts in raw slot.
adata.raw = adata
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
# normalize to depth 10 000
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
# logaritmize
sc.pp.log1p(adata)
# Identifying highly variable genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['doublet_scores','doublet_info','sample'])
Now we can remove all the detected doublets
# also revert back to the raw counts as the main matrix in adata
adata = adata.raw.to_adata()
adata = adata[adata.obs['doublet_info'] == 'False',:]
print(adata.shape)
#Saving the filtered object
adata.write_h5ad(os.path.join(path_results, "covid_filtered_removed_doublets.h5ad"))
细胞周期状态
There are gene markers by which we can score the cell cycle state. The algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring adds three slots in the metadata, a score for S phase, a score for G2M phase and the predicted cell cycle phase.
First read the file with cell cycle genes, from Regev lab and split into S and G2M phase genes. We first download the file.
path_file = os.path.join(path_results, 'regev_lab_cell_cycle_genes.txt')
if not os.path.exists(path_file):
urllib.request.urlretrieve(os.path.join(url_data, 'regev_lab_cell_cycle_genes.txt'), path_file)
cell_cycle_genes = [x.strip() for x in open(os.path.join(path_results, "regev_lab_cell_cycle_genes.txt"))]
print(len(cell_cycle_genes))
# Split into 2 lists
s_genes = cell_cycle_genes[:43]
g2m_genes = cell_cycle_genes[43:]
cell_cycle_genes = [x for x in cell_cycle_genes if x in adata.var_names]
print(len(cell_cycle_genes))
The data is already normalized and we can perform cell cycle scoring. The function is actually a wrapper to sc.tl.score_gene_list
, which is launched twice, to score separately S and G2M phases. Both sc.tl.score_gene_list
andsc.tl.score_cell_cycle_genes
are a port from Seurat and are supposed to work in a very similar way.
#adata = sc.read_h5ad(os.path.join(path_results, ('scanpy_covid_qc.h5ad')))
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)
sc.pl.violin(adata, ['S_score', 'G2M_score'], jitter=0.4, groupby = 'sample', rotation=45)
In this case it looks like we only have a few cycling cells in these datasets and we do not need to regress out the cell cycle effect.
Scanpy does an automatic prediction of cell cycle phase with a default cutoff of the scores at zero. As you can see this does not fit this data very well, so be cautios with using these predictions. Instead we suggest that you look at the scores.
sc.pl.scatter(adata, x='S_score', y='G2M_score', color="phase")
adata.write_h5ad(os.path.join(path_results, "covid_filtered_removed_doublets_cell_cycle.h5ad"))