About the Project

PIFiA dataset includes images from high-throughput screens of the yeast ORF-GFP collection combined with single-cell analysis of subcellular localization patterns obtained using self-supervised deep neural network PIFiA, as described in Razdaibiedina et al. 2023. PIFiA dataset contains images of 4049 strains expressing a GFP-tagged protein visible above background fluorescence that were obtained using an automated confocal microscope. Cell images were obtained from two biological replicates, each with four fields of view for each GFP-tagged strain. All screens are included in TheCellVision.org together with single-cell micrographs, and results from PIFiA downstream analyses for subcompartmental localization patterns.

Whole-proteome interactive t-SNE map

On the main page of the PIFiA project on TheCellVision.org website, we provide an interactive t-SNE map with 4049 proteins – produced based on the features obtained across the whole proteome. Each point on this map represents a protein and can be interactively selected. Upon selection, a tab with protein description and PIFiA downstream analysis results appear. The t-SNE is colored according to 15 localization categories defined by Huh et al., 2003 (left legend); protein complexes can be interactively selected to be highlighted on the map (right legend).

Search tool results

Results about a certain protein (three tabs) appear after selecting this protein on a map or putting its name in a search box:

  1. Description
    • Standard name:  Official or standardized name of the protein
    • ORF:  Open Reading Frame - the DNA sequence that potentially encodes a protein
    • Aliases:  Other names or identifiers used to refer to the protein
    • Human Ortholog:  Equivalent protein in humans, if known
    • Description:  Brief summary of the protein function according to “Saccharomyces Genome Database” (https://www.yeastgenome.org/)
    • Localization:  Subcellular localization of the protein defined by PIFiA standard from Razdaibiedina et al., 2024
    • Localization Type:  Type of homogeneity of subcellular localization (e.g., homogeneous; mixed OR-type; mixed AND-type). Defined by the percentage of cells that exhibit localization heterogeneity.
    • Cell Percentages:  Percentage of cells in which the protein is localized (reported for predominant localization)
    • Cell Cycle Cregulation:  Any information about how the protein's localization or function may vary during the cell cycle.
    • Subcompartmental Group:  Category indicating the specific sub-compartment inside the organelle where the protein is localized (e.g. nucleus-5; cytoplasm-2) from Razdaibiedina et al., 2024.
  2. Images
    • Protein name:  Official or standardized name of the protein shown in GFP screen
    • Replicate:  replicate 1 or replicate 2 where images are taken from
  3. Analysis
    • Nearest neighbours:  Displays proteins that exhibit similar subcellular localization patterns based on PIFiA feature profiles similarity to the queried protein
    • Correlation threshold:  Specifies the threshold for similarity correlation between proteins, allowing users to adjust the stringency of the comparison
    • Show neighbors on t-SNE:  Visualizes similar proteins on the whole-proteome t-SNE to provide a comprehensive understanding of their spatial arrangement within the cellular context
    • Enrichment analyses based on Gene Ontology (GO):  results of enrichment analyses to uncover significant associations between proteins and biological processes, molecular functions, and cellular components defined by Gene Ontology terms. Results are shown in tables with the following columns:
      • GO term:  Identifier for the Gene Ontology term associated with the enrichment analysis result
      • Term name:  Descriptive name of the Gene Ontology term
      • Overlap:  Number of genes from the nearest neighbours that overlap with the GO term
      • P-value:  Statistical significance level indicating the likelihood of observing the overlap by chance
      • Adjusted p-value:  P-value adjusted for multiple hypothesis testing to control for false positives
      • -log10 adjusted p-value:  Negative logarithm of the adjusted p-value, providing a more intuitive representation of significance
      • Fold enrichment:  Ratio of the observed overlap to the expected overlap, indicating the enrichment of genes associated with the GO term
      • Genes:  List of genes from the nearest neighboursr that are associated with the GO term, providing insight into the molecular components contributing to the enriched biological processes, molecular functions, or cellular components