Undergraduate → Probability and Statistics ↓
Statistics
Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, presenting, and organizing data. In the context of probability and statistics, it is primarily concerned with making predictions or inferences about the entire population based on a sample of data. Statistics is used in every field from business to science to make informed decisions.
What is data?
Data refers to information collected for reference or analysis. In statistics, data is essential because it represents the observations or measurements one is interested in. Data can be classified into:
- Qualitative data: These are descriptive items such as names, labels, or categories. Common examples are colors, city names, and types of music.
- Quantitative data: This type of data refers to numerical information that can be measured or counted. Examples include height, weight, and temperature.
Types of statistics
In statistics, there are two main types:
- Descriptive Statistics: This involves summarizing and presenting data. Measures such as mean, median, mode, and standard deviation are used to provide an overview of the main characteristics of the dataset.
- Inferential Statistics: This is about making generalisations or predictions about the population based on sample data. This includes using techniques such as hypothesis testing, confidence intervals and regression analysis.
Descriptive statistics
Meaning
The mean is the average of a set of values. It is calculated by adding up all the numbers and then dividing by the number of those numbers.
Mean = (Sum of all data points) / (Number of data points)
For example, if you have a dataset of numbers: 2, 4, 6, 8, and 10, the mean would be:
Mean = (2 + 4 + 6 + 8 + 10) / 5 = 6
Median
The median is the middle value when the data points are arranged in ascending order. If the number of observations is even, the median is the average of the two middle numbers.
For example, in the dataset: 3, 5, 7, 9, the median is:
Median = (5 + 7) / 2 = 6
Method
The mode is the number that appears most often in the dataset. A dataset may have one mode, more than one mode, or no mode.
For example, the dataset: 3, 3, 5, 7, 9 has the mode:
Mode = 3
Standard deviation
Standard deviation is a measure of the range of variation or dispersion in a dataset. A low standard deviation means that the data points are close to the mean, while a high deviation means that the data points are spread across a larger range.
Standard Deviation (σ) = √[(Σ(Xi - μ)²) / N]
Here, Σ
denotes the sum, Xi
are the individual data points, μ
is the mean of the data, and N
is the number of data points.
Inferential statistics
Inferential statistics uses sample data to make inferences, decisions, predictions, or other generalizations about a population. Some of the primary methods include:
Hypothesis testing
It is a statistical method used to make a decision about a hypothesis. The null hypothesis represents the status quo, and the alternative hypothesis represents what we want to prove. The main components are:
- Null hypothesis (
H0
): It is assumed that there is no significant difference or effect. - Alternative Hypothesis (
H1
): It is assumed that there is a significant difference or effect. - p-value: Determines the probability of observing the sample results if the null hypothesis is true. A small p-value (< 0.05) indicates strong evidence against the null hypothesis.
Confidence interval
A confidence interval is a range of values that is used to estimate the true value of a population parameter. It is an interval estimate that gives us a range where we expect the true parameter to be.
Confidence Interval = Statistic ± Margin of Error
For a 95% confidence level, there is a 95% probability that the interval calculated from the sample contains the true population parameter.
Regression analysis
Regression analysis is a statistical method for examining the relationship between two or more variables. It allows us to understand how a specific value of a dependent variable changes when any of the independent variables are changed.
y = mx + c
Here, y
denotes the dependent variable, m
is the slope, x
is the independent variable, and c
is the y-intercept.
Applications of statistics
The use of statistics is widespread. Some of the major areas where it plays a vital role include:
- Business: Decision making, market analysis, and financial performance measurement rely heavily on statistical methods such as trends and regression analysis.
- Health care: Predicting disease outbreaks, treatment effectiveness, and patient demographics.
- Science and research: analysis of experimental data, survey sample analysis, and research study design.
- Sports: Performance analysis, improving sports tactics and training techniques.
Conclusion
Statistics is a powerful tool used to extract information from data. Its vast range of techniques and methods is useful in a wide variety of fields as it helps summarize, analyze, and draw conclusions about information. The basics of statistics, such as measures of central tendency, dispersion, and inferential techniques, prepare individuals to understand the story of data and make informed decisions.