Exploratory Statistics: Mean, Median, Quantiles, Variance/SD
Last updated
Last updated
Exploratory statistics serves as the initial stage of data analysis, focusing on summarizing and understanding the essential characteristics of a dataset. It involves employing various descriptive statistics to gain insights into the distribution, central tendency, and variability of the data. Key measures include the mean, median, quantiles, variance, and standard deviation, each providing unique perspectives on the dataset.
It's important to note that these measures differ from the parameters of a random variable's theoretical distribution, such as the population mean or variance. Instead, they are derived from observed data and offer a practical summary of the sample at hand.
The mean, or average, is perhaps the most commonly used measure of central tendency in exploratory statistics. It represents the sum of all data values divided by the total number of observations. The mean provides a single numerical summary of the data's central location, indicating the typical value around which the observations tend to cluster. While sensitive to extreme values (outliers), the mean offers a straightforward interpretation and is often utilized in various analytical contexts.
Unlike the mean, which is influenced by extreme values, the median represents the middle value of a dataset when arranged in ascending or descending order. It is robust to outliers and provides a measure of central tendency that is less affected by extreme observations. The median is particularly useful when the dataset contains skewed distributions or when there are concerns about the influence of outliers on the mean. It offers a more robust representation of the typical value, especially in scenarios where the data is not symmetrically distributed.
For a discrete random variable, the mode is the value with the highest probability mass function (PMF). For a continuous random variable, it refers to the peak of the probability density function (PDF). The mode can be directly derived from the distribution's parameters (mean, variance, etc.) without needing a sample. In a sample (a set of observed data points), the mode is the most frequently occurring value. It's derived from the data and represents the value that appears most frequently.
Quantiles divide a dataset into equal-sized portions, providing insight into the distribution of data across various percentiles. Common examples include quartiles (dividing the data into four parts) and percentiles (dividing the data into hundred parts). Quantiles help identify the spread and variability of the data, facilitating comparisons and understanding of data distributions. They are particularly useful for assessing the relative position of individual observations within a dataset and for identifying potential outliers or extreme values.
Variance and standard deviation quantify the spread or dispersion of data points around the mean. Variance measures the average squared deviation of each data point from the mean, providing a measure of the overall variability within the dataset. Standard deviation, the square root of the variance, offers a more interpretable measure by providing the spread of data in the same units as the original data. Together, variance and standard deviation offer insights into the degree of variability within the dataset, aiding in understanding the distribution's shape and characteristics. They are fundamental measures in exploratory statistics, providing valuable information about the data's variability and distribution.