ABI Bioinformatics Guide 2024
  • INTRODUCTION
    • How to use the guide
  • MOLECULAR BIOLOGY
    • The Cell
      • Cells and Their Organelles
      • Cell Specialisation
      • Quiz 1
    • Biological Molecules
      • Carbohydrates
      • Lipids
      • Nucleic Acids (DNA and RNA)
      • Quiz 2
      • Proteins
      • Catalysis of Biological Reactions
      • Quiz 3
    • Information Flow in the Cell
      • DNA Replication
      • Gene Expression: Transcription
      • Gene Expression: RNA Processing
      • Quiz 4
      • Chromatin and Chromosomes
      • Regulation of Gene Expression
      • Quiz 5
      • The Genetic Code
      • Gene Expression: Translation
    • Cell Cycle and Cell Division
      • Quiz 6
    • Mutations and Variations
      • Point mutations
      • Genotype-Phenotype Interactions
      • Quiz 7
  • PROGRAMMING
    • Python for Genomics
    • R programming (optional)
  • STATISTICS: THEORY
    • Introduction to Probability
      • Conditional Probability
      • Independent Events
    • Random Variables
      • Independent, Dependent and Controlled Variables
    • Data distribution PMF, PDF, CDF
    • Mean, Variance of a Random Variable
    • Some Common Distributions
    • Exploratory Statistics: Mean, Median, Quantiles, Variance/SD
    • Data Visualization
    • Confidence Intervals
    • Comparison tests, p-value, z-score
    • Multiple test correction: Bonferroni, FDR
    • Regression & Correlation
    • Dimentionality Reduction
      • PCA (Principal Component Analysis)
      • t-SNE (t-Distributed Stochastic Neighbor Embedding)
      • UMAP (Uniform Manifold Approximation and Projection)
    • QUIZ
  • STATISTICS & PROGRAMMING
  • BIOINFORMATICS ALGORITHMS
    • Introduction
    • DNA strings and sequencing file formats
    • Read alignment: exact matching
    • Indexing before alignment
    • Read alignment: approximate matching
    • Global and local alignment
  • NGS DATA ANALYSIS & FUNCTIONAL GENOMICS
    • Experimental Techniques
      • Polymerase Chain Reaction
      • Sanger (first generation) Sequencing Technologies
      • Next (second) Generation Sequencing technologies
      • The third generation of sequencing technologies
    • The Linux Command-line
      • Connecting to the Server
      • The Linux Command-Line For Beginners
      • The Bash Terminal
    • File formats, alignment, and genomic features
      • FASTA & FASTQ file formats
      • Basic Unix Commands for Genomics
      • Sequences and Genomic Features Part 1
      • Sequences and Genomic Features Part 2: SAMtools
      • Sequences and Genomic Features Part 3: BEDtools
    • Genetic variations & variant calling
      • Genomic Variations
      • Alignment and variant detection: Practical
      • Integrative Genomics Viewer
      • Variant Calling with GATK
    • RNA Sequencing & Gene expression
      • Gene expression and how we measure it
      • Gene expression quantification and normalization
      • Explorative analysis of gene expression
      • Differential expression analysis with DESeq2
      • Functional enrichment analysis
    • Single-cell Sequencing and Data Analysis
      • scRNA-seq Data Analysis Workflow
      • scRNA-seq Data Visualization Methods
  • FINAL REMARKS
Powered by GitBook
On this page
  • Probability Mass Functions (PMFs)
  • Probability Density Functions (PDFs)
  • Cumulative Distribution Function (CDF)

Was this helpful?

  1. STATISTICS: THEORY

Data distribution PMF, PDF, CDF

Probability Mass Functions (PMFs)

Probability distributions are essential tools in understanding the behavior of random variables. For discrete random variables, we use Probability Mass Functions (PMFs), which assign probabilities to each possible outcome. The sum of these probabilities always equals 1, providing a complete description of the distribution. PMFs allow us to determine the likelihood of observing a specific value of the random variable.

1. PMF of a Fair Six-Sided Die

A fair six-sided die has outcomes {1, 2, 3, 4, 5, 6}. Each outcome has an equal probability of occurring.

P(X=x)={16if x∈{1,2,3,4,5,6}0otherwiseP(X = x) = \begin{cases} \frac{1}{6} & \text{if } x \in \{1, 2, 3, 4, 5, 6\} \\ 0 & \text{otherwise} \end{cases}P(X=x)={61​0​if x∈{1,2,3,4,5,6}otherwise​

2. PMF of a Biased Coin

A biased coin has a 70% chance of landing on heads (H) and a 30% chance of landing on tails (T).

P(X=x)={0.7x=H0.3x=T0otherwise P(X = x) = \begin{cases} 0.7 & x = \text{H} \\ 0.3 & x = \text{T} \\ 0 & \text{otherwise} \end{cases} P(X=x)=⎩⎨⎧​0.70.30​x=Hx=Totherwise​

Probability Density Functions (PDFs)

Conversely, for continuous random variables, Probability Density Functions (PDFs) are employed. Unlike PMFs, PDFs indicate the likelihood of the variable falling within a particular range. While the PDF itself doesn't provide probabilities directly, the area under the curve within a range represents the probability of the variable falling within that range. Understanding the shape and behavior of PDFs is crucial for analyzing continuous probability distributions.

Uniform Distribution Example:

The time it takes for a bus to arrive at a bus stop, assuming buses arrive at a regular interval. If buses arrive every 10 minutes, and you arrive at the bus stop at a random time, the waiting time XXX can be modeled as a uniform distribution between 0 and 10 minutes.

​ f(x)={1100≤x≤100otherwisef(x) = \begin{cases} \frac{1}{10} & 0 \le x \le 10 \\ 0 & \text{otherwise} \end{cases}f(x)={101​0​0≤x≤10otherwise​

Cumulative Distribution Function (CDF)

Both PMFs and PDFs are complemented by the Cumulative Distribution Function (CDF). The CDF provides a comprehensive view of the probability distribution by specifying the probability that the variable takes on a value less than or equal to a given value. It accumulates the probabilities of all possible outcomes, starting at zero and approaching one as the variable's value increases. The CDF is indispensable for calculating probabilities, making statistical inferences, and understanding the behavior of random variables across their entire range. By understanding PMFs, PDFs, and CDFs, analysts can effectively model and analyze data in various fields, enabling informed decision-making and predictions.

CDF of a Uniform Distribution

Example: The time you wait for a bus that arrives every 10 minutes uniformly.

The CDF of a uniform distribution over [a,b][a,b][a,b] is:

​ F(x)={0x<ax−ab−aa≤x≤b1x>bF(x) = \begin{cases} 0 & x < a \\ \frac{x - a}{b - a} & a \le x \le b \\ 1 & x > b \end{cases}F(x)=⎩⎨⎧​0b−ax−a​1​x<aa≤x≤bx>b​

For a=0 and b=10a=0 \ and \ b=10a=0 and b=10:

F(x)={0x<0x100≤x≤101x>10F(x) = \begin{cases} 0 & x < 0 \\ \frac{x}{10} & 0 \le x \le 10 \\ 1 & x > 10 \end{cases}F(x)=⎩⎨⎧​010x​1​x<00≤x≤10x>10​

PreviousIndependent, Dependent and Controlled VariablesNextMean, Variance of a Random Variable

Last updated 10 months ago

Was this helpful?