Numerical and Categorical Variables
A variable is a measurable characteristic of a given unit/observation.
What is a Random Variable
A random variable is a variable defined in the context of a random experiment.
Numerical Variable
A variable that has a number associated with the characteristic.
Examples:
- Weekly earnings of an individual
- Daily facility production
- Annual GDP of a country
Literally anything quantifiable.
Three types of Numerical Variable:
Discrete Variable
A variable where the number of possible values can be counted, even if the number of possible values is infinite. Think of countability. Basically things like integers, those are good and discrete.
Examples:
- Number of children in a house hold
- Number of patents given to a firm in a year
- Number of states with a Republican governor
Probability Mass Function (PMF)
A probability mass function assigns probabilities to the different values that a discrete random variable can take.
Can be in a table, or function.
In general, the probability that a random variable takes a value is written as , , or
A random variable is always written in upper case and the numerical value we are trying to solve is always written in lower case.
Properties of PMF’s
- for all values of
Example: PMF as a Table
number of heads in 2 fair coin tosses
-
can take values in
-
-
-
0 | 1 | 2 | |
---|---|---|---|
Example: PMF as a Function
the number of rolls of a fair die until you get the first 6
can take what values?
PMF is where is the total rolls and is the probability of getting a 6
Mean and Variance of a Discrete Random Variable
Recall the sample mean formula for a discrete variable in terms of sample proportions:
Where is the possible outcome, and is the probability for that outcome.
When you repeat an experiment many many times, you converge to the true probability.
As a result, the population mean and sample mean are not the same:
Population Mean:
And the expected value of the population is which is the sum of all the probabilities.
Variance:
Recall the sample variance formula for a discrete variable in terms of sample proportions:
When you do this experiment many, many times, you approach the true population variance.
As such, population variance is:
Continuous Variable
A variable that can take on any value on some interval or intervals of the real line, including even maybe the entire line.
Examples:
- Monthly rainfall in Austin
- Annual GDP of the U.S.
- Daily stock return of IBM
Approximately Continuous Variable
A discrete variable that can be treated and modeled as continuous, which is the case when the 2 following conditions are met:
- The unit of measurement is small compared to the typical values of the variable
- The number of values in the data set is large, with relatively few repeats
Examples:
- Weekly earnings of an individual
- Number of employees in a firm
- Credit score for an individual
Categorical Variables
A variable that has two or more categories, but no obvious numerical measure associated with the characteristic.
Categorical variables may be ordered if there is some natural ordering of the choices (think of health status: excellent, okay, bad, terrible). They may also be unordered if there is no natural ordering to the choices (think of labor force status).
Examples:
- Labor force status of an individual
- Health rating of an individual
- State identifier
Binary Variables (Bernoulli Variables)
A special case of categorical variables with only two categories.
Examples:
- Married? (yes/no)
- Homeowner? (yes/no)
- Vaccinated? (yes/no)
Coding categorical variables (indicator variables)
We can code categorical variables by providing a unique value to each choice in the variable.
You need indicator variables.
Ex: Home Ownership
We can answer either yes or no, so, we can define the variable owner to be:
Ex: Labor Force Status
We have more than two options (employed, unemployed, not in labor force), so we define this as:
Note that we ended needing only 2 of these variables, because the rest is still included in the other variables.
Continuous Random Variables
A continuous random variable is a random variable that can take on any value on some interval, or intervals, of the real line, including perhaps the entire real line. The probability of any specific outcome occurring is equal to zero. We say that a continuous random variable has an uncountable set of possible outcomes.
We think of probabilities in terms of intervals instead of in terms of specific values.
Probability Density Function (PDF) and Cumulative Distribution Function (CDF) of Continuous Random Variables
The Cumulative Distribution Function (CDF|Probability%20Density%20Function%20(PDF)) of a continuous random variable is a function such that for any two numbers and with :
Properties of the pdf :
- for all