ABI Bioinformatics Guide 2024
  • INTRODUCTION
    • How to use the guide
  • MOLECULAR BIOLOGY
    • The Cell
      • Cells and Their Organelles
      • Cell Specialisation
      • Quiz 1
    • Biological Molecules
      • Carbohydrates
      • Lipids
      • Nucleic Acids (DNA and RNA)
      • Quiz 2
      • Proteins
      • Catalysis of Biological Reactions
      • Quiz 3
    • Information Flow in the Cell
      • DNA Replication
      • Gene Expression: Transcription
      • Gene Expression: RNA Processing
      • Quiz 4
      • Chromatin and Chromosomes
      • Regulation of Gene Expression
      • Quiz 5
      • The Genetic Code
      • Gene Expression: Translation
    • Cell Cycle and Cell Division
      • Quiz 6
    • Mutations and Variations
      • Point mutations
      • Genotype-Phenotype Interactions
      • Quiz 7
  • PROGRAMMING
    • Python for Genomics
    • R programming (optional)
  • STATISTICS: THEORY
    • Introduction to Probability
      • Conditional Probability
      • Independent Events
    • Random Variables
      • Independent, Dependent and Controlled Variables
    • Data distribution PMF, PDF, CDF
    • Mean, Variance of a Random Variable
    • Some Common Distributions
    • Exploratory Statistics: Mean, Median, Quantiles, Variance/SD
    • Data Visualization
    • Confidence Intervals
    • Comparison tests, p-value, z-score
    • Multiple test correction: Bonferroni, FDR
    • Regression & Correlation
    • Dimentionality Reduction
      • PCA (Principal Component Analysis)
      • t-SNE (t-Distributed Stochastic Neighbor Embedding)
      • UMAP (Uniform Manifold Approximation and Projection)
    • QUIZ
  • STATISTICS & PROGRAMMING
  • BIOINFORMATICS ALGORITHMS
    • Introduction
    • DNA strings and sequencing file formats
    • Read alignment: exact matching
    • Indexing before alignment
    • Read alignment: approximate matching
    • Global and local alignment
  • NGS DATA ANALYSIS & FUNCTIONAL GENOMICS
    • Experimental Techniques
      • Polymerase Chain Reaction
      • Sanger (first generation) Sequencing Technologies
      • Next (second) Generation Sequencing technologies
      • The third generation of sequencing technologies
    • The Linux Command-line
      • Connecting to the Server
      • The Linux Command-Line For Beginners
      • The Bash Terminal
    • File formats, alignment, and genomic features
      • FASTA & FASTQ file formats
      • Basic Unix Commands for Genomics
      • Sequences and Genomic Features Part 1
      • Sequences and Genomic Features Part 2: SAMtools
      • Sequences and Genomic Features Part 3: BEDtools
    • Genetic variations & variant calling
      • Genomic Variations
      • Alignment and variant detection: Practical
      • Integrative Genomics Viewer
      • Variant Calling with GATK
    • RNA Sequencing & Gene expression
      • Gene expression and how we measure it
      • Gene expression quantification and normalization
      • Explorative analysis of gene expression
      • Differential expression analysis with DESeq2
      • Functional enrichment analysis
    • Single-cell Sequencing and Data Analysis
      • scRNA-seq Data Analysis Workflow
      • scRNA-seq Data Visualization Methods
  • FINAL REMARKS
Powered by GitBook
On this page
  • Mean
  • Median
  • Mode
  • Quantiles
  • Variance and Standard Deviation

Was this helpful?

  1. STATISTICS: THEORY

Exploratory Statistics: Mean, Median, Quantiles, Variance/SD

PreviousSome Common DistributionsNextData Visualization

Last updated 11 months ago

Was this helpful?

Exploratory statistics serves as the initial stage of data analysis, focusing on summarizing and understanding the essential characteristics of a dataset. It involves employing various descriptive statistics to gain insights into the distribution, central tendency, and variability of the data. Key measures include the mean, median, quantiles, variance, and standard deviation, each providing unique perspectives on the dataset.

It's important to note that these measures differ from the parameters of a random variable's theoretical distribution, such as the population mean or variance. Instead, they are derived from observed data and offer a practical summary of the sample at hand.

Mean

The mean, or average, is perhaps the most commonly used measure of central tendency in exploratory statistics. It represents the sum of all data values divided by the total number of observations. The mean provides a single numerical summary of the data's central location, indicating the typical value around which the observations tend to cluster. While sensitive to extreme values (outliers), the mean offers a straightforward interpretation and is often utilized in various analytical contexts.

Median

Unlike the mean, which is influenced by extreme values, the median represents the middle value of a dataset when arranged in ascending or descending order. It is robust to outliers and provides a measure of central tendency that is less affected by extreme observations. The median is particularly useful when the dataset contains skewed distributions or when there are concerns about the influence of outliers on the mean. It offers a more robust representation of the typical value, especially in scenarios where the data is not symmetrically distributed.

Mode

For a discrete random variable, the mode is the value with the highest probability mass function (PMF). For a continuous random variable, it refers to the peak of the probability density function (PDF). The mode can be directly derived from the distribution's parameters (mean, variance, etc.) without needing a sample. In a sample (a set of observed data points), the mode is the most frequently occurring value. It's derived from the data and represents the value that appears most frequently.

Quantiles

Quantiles divide a dataset into equal-sized portions, providing insight into the distribution of data across various percentiles. Common examples include quartiles (dividing the data into four parts) and percentiles (dividing the data into hundred parts). Quantiles help identify the spread and variability of the data, facilitating comparisons and understanding of data distributions. They are particularly useful for assessing the relative position of individual observations within a dataset and for identifying potential outliers or extreme values.

Variance and Standard Deviation

Variance and standard deviation quantify the spread or dispersion of data points around the mean. Variance measures the average squared deviation of each data point from the mean, providing a measure of the overall variability within the dataset. Standard deviation, the square root of the variance, offers a more interpretable measure by providing the spread of data in the same units as the original data. Together, variance and standard deviation offer insights into the degree of variability within the dataset, aiding in understanding the distribution's shape and characteristics. They are fundamental measures in exploratory statistics, providing valuable information about the data's variability and distribution.