import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scanpy as sc
import scvelo as scv
RNA velocity with scvelo
Download Presentation: RNA velocity
RNA velocity is a method for estimating the rate of change in gene expression in scRNA-seq dataset. For pseudotime trajectory inference, you need to specify a “root” cell at the start of the trajectory. On the other hand, RNA velocity tells you the direction along which cells are evolving in gene expression space. RNA velocity achieves this by examining the ratio of unspliced (intron-containing pre-mRNA) and spliced (exon-only mature mRNA) UMIs in a dataset.
Loading data into scvelo and examining spliced/unspliced counts
In the presentation, we walked through the theoretical foundations behind RNA velocity. Here, we will demonstrate how to practically apply it to a dataset using the scvelo package, which is nicely integrated with the scanpy
framework we have been working with during the past two days. This exercise is adapted from tutorials on the scvelo documentation page.
First, we will load a dataset on pancreatic endocrinogenesis from a recent study:
= scv.datasets.pancreas()
adata adata
AnnData object with n_obs × n_vars = 3696 × 27998
obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
var: 'highly_variable_genes'
uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'distances', 'connectivities'
Exercise 1: As we can see from the metadata, this dataset already contains a low dimensional UMAP embedding with cluster annotations. Can you plot this UMAP embedding, coloring the cells by clusters
?
# your code here
Exercise 2: Since we will run RNA velocity on these data, we need to have both a spliced
and unspliced
count matrix. These should be stored in the adata
object. Can you find them in adata.layers
?
Using the two matrices, can you calculate the fraction of unspliced UMI counts out of the total UMIs, averaged across all of the cells? Hint: You can do this computationally using the .sum(1)
and .mean()
functions.
We can also use scvelo.pl.proportions
to compute and display the proportions of spliced/unspliced counts. Depending on the protocol used, we typically have between 10%-25% of unspliced molecules containing intronic sequences. For single-nuclei data, you will have many more intronic reads, approximately 60%-70%. We suggest you to examine the variations on cluster level to verify consistency in splicing efficiency.
Exercise 3: Use scv.pl.proportions
to plot the proportions of spliced and unspliced UMIs. Use the groupby
argument to specify that you want to compute the proportions separately for each cluster annotation.
# your code here
Here, we find variations as expected, with slightly lower unspliced proportions at cycling ductal cells, then higher proportion at cell fate commitment in Ngn3-high and Pre-endocrine cells, where many genes start to be transcribed.
Data preprocessing
Next, as with our standard scRNA-seq analysis pipeline, we need to preprocess the data! This requires performing the following steps, which you have studied in previous exercises: - Gene filtering (with a minimum number of counts per cell) - Normalization - Log transformation
In scvelo, these steps are combined into a single function, called scv.pp.filter_and_normalize
. We will run that command below with two parameters specified: - min_shared_counts
requires a minimum number of counts (both spliced and unspliced) for all genes; any other genes are filtered out - n_top_genes
is similar to sc.pp.highly_variable_genes
from scanpy, finding the top variable genes and filtering out the others
Exercise 4: Run the filter_and_normalize
command requiring a min_shared_counts
of at least 20 and selecting 2,000 n_top_genes
.
Exercise 5: As we mentioned, scv.pp.filter_and_normalize
combines several scanpy functions into a single command. However, if you want full control over the filtering, normalization, and log-transformation steps, you can run each command individually. Can you write the five lines of code needed to achieve the above steps?
Exercise 6: Next, we need to compute a PCA and neighborhood graph, as we have done previously. Write the scanpy
commands to compute the PCA and then the neighborhood graph (using n_pcs=30
and n_neighbors=30
)
Next, we need to compute the first and second order moments (means and uncentered variances) computed among nearest neighbors in PCA space, summarized in scv.pp.moments
.
=None, n_neighbors=None) scv.pp.moments(adata, n_pcs
computing moments based on connectivities
finished (0:00:00) --> added
'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode=‘deterministic’). For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.
scv.tl.velocity(adata)
computing velocities
finished (0:00:01) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
The combination of velocities across genes can then be used to estimate the future state of an individual cell. In order to project the velocities into a lower-dimensional embedding, transition probabilities of cell-to-cell transitions are estimated. That is, for each velocity vector we find the likely cell transitions that are accordance with that direction. The transition probabilities are computed using cosine correlation between the potential cell-to-cell transitions and the velocity vector, and are stored in a matrix denoted as velocity graph. The resulting velocity graph has dimension 𝑛𝑜𝑏𝑠×𝑛𝑜𝑏𝑠 and summarizes the possible cell state changes that are well explained through the velocity vectors (for runtime speedup it can also be computed on reduced PCA space by setting approx=True).
scv.tl.velocity_graph(adata)
computing velocity graph (using 1/160 cores)
finished (0:00:10) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
Finally, the velocities are projected onto any embedding, specified by basis, and visualized in one of these ways:
- on cellular level with
scv.pl.velocity_embedding
- as gridlines with
scv.pl.velocity_embedding_grid
- as streamlines with
scv.pl.velocity_embedding_stream
.
Note, that the data has an already pre-computed UMAP embedding, and annotated clusters. When applying to your own data, these can be obtained with scv.tl.umap
and scv.tl.louvain
, which are direct wrappers around the scanpy functions.
The most fine-grained resolution of the velocity vector field we get at single-cell level, with each arrow showing the direction and speed of movement of an individual cell.
Exercise 7: Plot the RNA velocity using the above functions scv.pl.velocity_embedding
and scv.pl.velocity_embedding_stream
. What looks different about the two representations? Can you color
the embeddings by the clusters? For scv.pl.velocity_embedding
, what happens what you change the values for the arrow_length
and arrow_size
parameters?
# your code here
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
# your code here
Phase portraits and gene-wise velocity results
This is perhaps the most important part! RNA velocity is computed for each gene with scvelo, and the user should not limit biological conclusions to the projected velocities, but rather examine individual gene dynamics via phase portraits. This will allow you to better understand how inferred directions are supported by particular genes.
As we discussed in our live tutorial into the theoretical foundations of RNA velocity: Gene activity is orchestrated by transcriptional regulation. Transcriptional induction for a particular gene results in an increase of (newly transcribed) precursor unspliced mRNAs while, conversely, repression or absence of transcription results in a decrease of unspliced mRNAs. Spliced mRNAs is produced from unspliced mRNA and follows the same trend with a time lag. Time is a hidden/latent variable. Thus, the dynamics needs to be inferred from what is actually measured: spliced and unspliced mRNAs as displayed in the phase portrait.
Exercise 8: Now, let us examine the phase portraits of some marker genes with scv.pl.velocity(adata, gene_names)
. Set var_names
to a list containing the genes Cpe
, Gnao1
, Ins2
and Ank
.
# your code here
Exercise 9: Describe the plots above: what can you tell me about the patterns of the four genes, Cpe, Gnao1, Ins2, Adk, along the differentiation trajectory. Transitioning from ductal cells to mature alpha and beta cells, which genes are being upregulated? Which genes are being downregulated along the same trajectory? Are any of the genes limited to a particular cell type or lineage?
We can also use scv.pl.scatter
to visualize the phase portrait for a single gene, colored both by the cell clusters and the magnitude of the velocity:
'Cpe', color=['clusters', 'velocity'],
scv.pl.scatter(adata, ='Ngn3 high EP, Pre-endocrine, Beta') add_outline
Exercise 10: What does the black dashed line represent in the phase portrait plots, and how does this relate to the way in which RNA velocity is determined?
We need a systematic way to identify genes that may help explain the resulting vector field and inferred lineages. To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population. The module scv.tl.rank_velocity_genes
runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. min_corr) to restrict the test on a selection of gene candidates.
='clusters', min_corr=.3)
scv.tl.rank_velocity_genes(adata, groupby
= pd.DataFrame(adata.uns['rank_velocity_genes']['names'])
df df.head()
ranking velocity genes
/home/alex/anaconda3/envs/sctp/lib/python3.8/site-packages/scvelo/tools/utils.py:463: DeprecationWarning: Please use `rankdata` from the `scipy.stats` namespace, the `scipy.stats.stats` namespace is deprecated.
from scipy.stats.stats import rankdata
finished (0:00:02) --> added
'rank_velocity_genes', sorted scores by group ids (adata.uns)
'spearmans_score', spearmans correlation scores (adata.var)
Ductal | Ngn3 low EP | Ngn3 high EP | Pre-endocrine | Beta | Alpha | Delta | Epsilon | |
---|---|---|---|---|---|---|---|---|
0 | Notch2 | Ptpn3 | Pde1c | Sdk1 | Pax6 | Zcchc16 | Ptprt | Ica1 |
1 | Sox5 | Slc9a9 | Pclo | Abcc8 | Unc5c | Prune2 | Zdbf2 | Ctnna2 |
2 | Krt19 | Kcnq1 | Grik2 | Gnas | Nnat | Nell1 | Spock3 | Sorcs1 |
3 | Ano6 | Dcbld1 | Ttyh2 | Ptprn2 | Scg3 | Aff2 | Slc16a12 | A1cf |
4 | Rbms3 | Trp53cor1 | Setbp1 | Asic2 | Tmem108 | Chrm3 | Fam155a | Zcchc16 |
= dict(frameon=False, size=10, linewidth=1.5,
kwargs ='Ngn3 high EP, Pre-endocrine, Beta')
add_outline
'Ngn3 high EP'][:5], ylabel='Ngn3 high EP', **kwargs)
scv.pl.scatter(adata, df['Pre-endocrine'][:5], ylabel='Pre-endocrine', **kwargs) scv.pl.scatter(adata, df[
The genes Ptprs, Pclo, Pam, Abcc8, Gnas, for instance, support the directionality from Ngn3 high EP (yellow) to Pre-endocrine (orange) to Beta (green).
Dynamical modeling of RNA velocity
Since RNA velocity yields insights into the directionality of gene expression change, we can use the approach to infer a trajectory. One way this is acheives is by recovering estimates of the full transcriptional dynamics (i.e., the transcription rate, the splicing rate, and the degradation rate) instead of using the steady-state asusmption and linear fits. This is particularly useful when you have a dataset without a cluster of cells representing the “steady-state”.
Dynamical modeling of RNA velocity is possible with scvelo and allows for: - Estimation of a latent time - Identification of possible driver genes
We run the dynamical model to learn the full transcriptional dynamics of splicing kinetics.
It is solved in a likelihood-based expectation-maximization framework, by iteratively estimating the parameters of reaction rates and latent cell-specific variables, i.e. transcriptional state and cell-internal latent time. It thereby aims to learn the unspliced/spliced phase trajectory for each gene.
The function scv.tl.recover_dynamics
uses expectation maximization to recover the gene velocity dynamics. It will take about 5 mins to run. In the meantime, have a look at the following steps.
scv.tl.recover_dynamics(adata)
recovering dynamics (using 1/160 cores)
finished (0:06:39) --> added
'fit_pars', fitted parameters for splicing dynamics (adata.var)
Then, we before we need to estiamte the velocity and compute the velocity graph, specifying this time the “dynamical” mode.
='dynamical')
scv.tl.velocity(adata, mode scv.tl.velocity_graph(adata)
computing velocities
finished (0:00:04) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph (using 1/160 cores)
finished (0:00:06) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
='umap') scv.pl.velocity_embedding_stream(adata, basis
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
With the dynamical model, The rates of RNA transcription, splicing and degradation are estimated without the need of any experimental data.
They can be useful to better understand the cell identity and phenotypic heterogeneity.
= adata.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]
df
= dict(xscale='log', fontsize=16)
kwargs with scv.GridSpec(ncols=3) as pl:
'fit_alpha'], xlabel='transcription rate', **kwargs)
pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs)
pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)
pl.hist(df[
'fit*', dropna=True).head() scv.get_df(adata,
fit_alpha | fit_beta | fit_gamma | fit_t_ | fit_scaling | fit_std_u | fit_std_s | fit_likelihood | fit_u0 | fit_s0 | fit_pval_steady | fit_steady_u | fit_steady_s | fit_variance | fit_alignment_scaling | fit_r2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||||||||
Sntg1 | 0.010688 | 0.003801 | 0.070298 | 26.614919 | 48.409039 | 1.015340 | 0.024470 | 0.382195 | 0.0 | 0.0 | 0.034819 | 2.402895 | 0.072442 | 0.238926 | 6.656648 | 0.467566 |
Sbspon | 0.466790 | 2.485521 | 0.407607 | 3.988951 | 0.145663 | 0.058015 | 0.188189 | 0.267242 | 0.0 | 0.0 | 0.206473 | 0.150839 | 0.503312 | 0.690054 | 1.242929 | 0.630782 |
Mcm3 | 3.759589 | 52.011484 | 0.885513 | 1.945249 | 0.011923 | 0.015849 | 0.686825 | 0.118566 | 0.0 | 0.0 | 0.483623 | 0.060455 | 2.054615 | 1.426064 | 0.765975 | 0.275293 |
Fam135a | 0.185720 | 0.124718 | 0.218989 | 10.854791 | 1.090526 | 0.353996 | 0.153378 | 0.293041 | 0.0 | 0.0 | 0.397257 | 1.331401 | 0.395752 | 0.611777 | 3.308631 | 0.363101 |
Adgrb3 | 0.034546 | 0.007763 | 0.222066 | 8.947714 | 120.303430 | 2.122017 | 0.030074 | 0.295653 | 0.0 | 0.0 | 0.063172 | 4.564255 | 0.093443 | 0.525843 | 1.825699 | 0.399879 |
The dynamical model recovers the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and approximates the real time experienced by cells as they differentiate, based only on its transcriptional dynamics. This offers advantages over traditional pseudotime trajectory inference approaches.
scv.tl.latent_time(adata)='latent_time', color_map=plt.cm.Spectral, size=80) scv.pl.scatter(adata, color
computing latent time using root_cells as prior
finished (0:00:01) --> added
'latent_time', shared time (adata.obs)
Exercise 11: Take a look back at the previous exercise, where the diffusion pseudotime was calculated for the same dataset. Are there differences between the dpt_pseudotime
and latent_time estimates
?
Driver genes display pronounced dynamic behavior and are systematically detected via their characterization by high likelihoods in the dynamic model. We can plot a heatmap of the top 300 genes expressed along the pseudotime.
= adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
top_genes =top_genes, sortby='latent_time', col_color='clusters', n_convolve=100) scv.pl.heatmap(adata, var_names
For any top candidates that you might want to validate biologically, it is always essential to examine the phase portraits, to ensure that the gene is not too noisy.
= adata.var['fit_likelihood'].sort_values(ascending=False).index
top_genes =top_genes[:15], ncols=5, frameon=False) scv.pl.scatter(adata, basis
= ['Actn4', 'Ppp3ca', 'Cpe', 'Nnat']
var_names =False)
scv.pl.scatter(adata, var_names, frameon='latent_time', y=var_names, frameon=False) scv.pl.scatter(adata, x
We can then use scv.tl.rank_dynamical_genes
to rank the top dynamical genes by cluster. These are the genes that most strongly contribute to the cell-wise velocity for cells belonging to each cluster.
='clusters')
scv.tl.rank_dynamical_genes(adata, groupby= scv.get_df(adata, 'rank_dynamical_genes/names')
df 5) df.head(
ranking genes by cluster-specific likelihoods
finished (0:00:03) --> added
'rank_dynamical_genes', sorted scores by group ids (adata.uns)
Ductal | Ngn3 low EP | Ngn3 high EP | Pre-endocrine | Beta | Alpha | Delta | Epsilon | |
---|---|---|---|---|---|---|---|---|
0 | Nfib | Lockd | Rbfox3 | Pcsk2 | Pcsk2 | Gnao1 | Pcsk2 | Pak3 |
1 | Top2a | Dcdc2a | Btbd17 | Abcc8 | Ank | Cpe | Abcc8 | Tox3 |
2 | Incenp | Adk | Tspan5 | Rap1b | Abcc8 | Pak3 | Pak3 | Rap1gap2 |
3 | Wfdc15b | Bicc1 | Rap1gap2 | Tmem163 | Tspan7 | Rap1b | Rap1b | Rnf130 |
4 | Cdk1 | Wfdc15b | Sulf2 | Ank | Cryba2 | Pim2 | Meis2 | Meis2 |
for cluster in ['Ductal', 'Ngn3 high EP', 'Pre-endocrine', 'Beta']:
5], ylabel=cluster, frameon=False) scv.pl.scatter(adata, df[cluster][:
Velocities in cycling progenitors
The cell cycle detected by RNA velocity, and it is biologically affirmed by cell cycle scores (standardized scores of mean expression levels of phase marker genes).
scv.tl.score_genes_cell_cycle(adata)=['S_score', 'G2M_score'], smooth=True, perc=[5, 95]) scv.pl.scatter(adata, color_gradients
calculating cell cycle phase
--> 'S_score' and 'G2M_score', scores of cell cycle phases (adata.obs)
For the cycling Ductal cells, we may screen through S and G2M phase markers. The previous module also computed a spearmans correlation score, which we can use to rank/sort the phase marker genes to then display their phase portraits.
= scv.utils.get_phase_marker_genes(adata)
s_genes, g2m_genes = scv.get_df(adata[:, s_genes], 'spearmans_score', sort_values=True).index
s_genes = scv.get_df(adata[:, g2m_genes], 'spearmans_score', sort_values=True).index
g2m_genes
= dict(frameon=False, ylabel='cell cycle genes')
kwargs list(s_genes[:2]) + list(g2m_genes[:3]), **kwargs) scv.pl.scatter(adata,
Particularly Hells and Top2a are well-suited to explain the vector field in the cycling progenitors. Top2a gets assigned a high velocity shortly before it actually peaks in the G2M phase. There, the negative velocity then perfectly matches the immediately following down-regulation.
Exercise 12: As you did previously for other genes, can you use scvelo plotting functions to visualize the phase portraits and gene-wise velocity for the Hells
and Top2a
genes?
# your code here
The cell cycle is an interesting case for RNA velocity estimation, as pseudotime methods along often fail as estimations of cyclical processes. Moreover, RNA velocity corresponds roughly to cell cycle speed, which is both experimentally verifiable. The cell cycle also unfolds on a timescale of less than 24 hours, which is well suited for studying cell dynamics using RNA lifecycle kinetics, such as with RNA velocity.