**Quick’n’Dirty
Regression Tutorial**

The statistical procedure known as linear regression basically involves drawing and analyzing trend-lines through data points. Economists use regression analysis to test hypotheses, derived from economic theory, against real-world data.

In your first microeconomics class you saw theoretical demand schedules (Figure 1) showing that if price increases, the quantity demanded ought to decrease. But when we collect market data to actually test this theory, we find that the data are “noisy” (Figure 2).

To analyze the empirical relationship between price and
quantity, download and open the
**Excel spreadsheet with the data**.
Right-click on the spreadsheet chart to open a chart window, and print off
a full-page copy of the chart (same as the one shown in Figure 2).
Using a pencil and
straightedge, “eyeball” and then draw a straight line through the cloud of
points that best fits the overall trend. Extend this line to both axes.
Now calculate the values of intercept A and slope B of the linear equation
that represents the trend-line
**Price = A + B*Quantity**

Although it is standard practice to graph supply and demand
with Price on the Y-axis and Quantity on the X-axis,
economists more often consider demand Quantity to be
the "dependent" variable influenced by the "independent" variable Price.
To obtain a more conventional demand equation, invert your equation,
solving for intercept and slope coefficients
*a* and *b*, where
* Quantity = a + b*Price*.
Technically, since this "empirical" (i.e., data-derived) demand model
doesn't fit through the data points exactly, it
ought to be written as

That's what linear regression is about: fitting trend lines through data to analyze relationships between variables. Since doing it by hand is imprecise and tedious, most economists and statisticians prefer to...

**Let the computer do it**

MS-Excel provides two methods for
fitting the best-fitting trend-line through data points, and calculating
that line's slope and intercept coefficients.
The standard criterion for "best fit" is the
trend line that
minimizes the sum of the squared deviations of the data points from
the fitted line. This is called the *ordinary least-squares* (OLS)
regression line. (If you got a bunch of people to fit regression lines by hand
and averaged their results, you would get something very close to the OLS line.)

The easiest way to plot a trend line and calculate a single-variable regression equation is to right-click on the data points in an Excel XY plot and select “Add Trendline." Under the "Options" tab check "“Display eqation on chart" and "OK." How well do this trend line and calculated slope and intercept coefficients match the line you drew and the slope and intercept you calculated?

Excel also includes a formal regression utility in its
“Analysis ToolPak” that provides statistics indicating goodness-of-fit and
confidence intervals for slope and intercept coefficients.
This utility lets you regress one dependent
"left-hand-side" (of the equal sign) variable against
one *or several* independent "right-hand side" variables, and it
provides useful
indicators about the statistical reliability of your model.
To use Excel's Regression utility,
check "Tools--Add-Ins"
to make sure
the Analysis ToolPak is loaded.
Then under
"Tools--Data Analysis"
select "Regression."

The only things
you are required to specify are...

(a) one column of numbers as the Y Range, *aka* the dependent
variable, "left-hand-side" variable or endogenous variable
whose variation is to be "explained" by the regression model;

(b) one or several adjacent columns of numbers as the X
Range,
*aka* the independent variables,
right-hand side (of the equals sign) variables, exogenous variables or
"explanatory" variables;

(c) the upper-left corner of a blank range of cells in your
spreadsheet where the results will be printed.

The X and Y ranges must contain the same number of rows,
all numeric data, no missing values.

Here is output from Excel's regression utility replicating the regression of Price (Y range) against Quantity (X range). At the bottom of the output you can see the same Intercept and Quantity slope coefficients that are shown for the trend line in the XY plot above. (Other parts of this output are explained below.)

Try specifing Quantity as the dependent variable and Price as the independent variable, and estimating the conventional demand regression model.

**Multivariate models**

Now try regressing Quantity (Y range) against both Price and
Income (the X range is both the Price and Income columns).
This will yield coefficient estimates for the
multivariate demand model
* Quantity = a + b*Price + c*Income +
e.* &nabsp; You should get something like this:

When analyzing your regression output, first check the signs of the model coefficients: are they consistent with your hypotheses? Is the Price coefficient negative as theory predicts? Does the Income coefficient indicate this is a normal good, or an inferior good? Try calculating the price and income elasticities using these slope coefficients and the average values of Price and Quantity.

The next thing you should check is the statistical
significance of your model coefficients. Because the data are noisy and the
regression line doesn’t fit the data points exactly, each reported coefficient is
really a point *estimate*, a mean value from a distribution of possible
coefficient estimates. So the residuals *e* (the remaining noise in
the data) are used to analyze the
statistical reliability of the regression coefficients.
The columns to the
right of the coefficients column at the bottom of the Excel output
report the standard errors, *t*-statistics,
P-values, and lower and upper 95% confidence bounds for each coefficient.

The standard error is the square root of the
variance of the regression coefficient. The *t*-statistic is
the coefficient estimate
divided by the standard error. If your regression is based on
what statisticians call a "large" sample
(30 or more observations),
a *t*-statistic greater than 2 (or less
than -2) indicates the coefficient is significant with >95% confidence. A *t*-statistic
greater than 1.68 (or less than -1.68) indicates the coefficient is significant
with >90% confidence.
The confidence thresholds for *t*-statistics are higher for small sample sizes.
This example uses only 21 observations to estimate 1 intercept and 2 slope
coefficients, which leaves 21 - 3 = 18 "degrees of freedom" (df) for calculating
significance levels.
In this example, the *t*-statistic on the Income coefficient is 2.037, which would exceed
the 95% confidence threshold for a "large" (N > 30 observations) dataset, but does not quite meet the
95% confidence threshold when N = 21 observations.

If that last paragraph is just statistical gibberish for you, no problem!--most
people just check the P-values. These are the probabilities
that the coefficients are *not *statistically significant. The
P-value of 0.056 for the Income coefficient implies 1 - 0.056 = 94.4% confidence
that the "true" coefficient is between 0 and about 1.02. The last two columns
report the exact lower and upper 95% confidence thresholds for the Income coefficient:
-0.0159 and +1.038 respectively. The very low P-values for the Intercept and Price
coefficients indicate they are very strongly significant, so their 95%
confidence intervals are relatively narrower.

The R-Square statistic near the top of the output represents
the percent of the total variation in the dependent variable that is explained
by the independent variables, i.e., the model's overall “goodness of fit.”
But whether a model is really a "good"
fit or not depends on context. R-squares for cross-sectional models
are typically much lower than R-squares for time-series models. You can always
increase R-square by throwing another independent variable
(*any* variable!)
into your model. Remember that your real objective is to test your hypotheses,
*not *to maximize R-square by including irrelevant
variables in your model and then making up some "hypothesis"
after the fact to "explain"
the results you got.

Those are all the diagnostics you really need to worry about.

**Final comments**

The classical OLS model assumes that the residuals *e* are
independent of each other and
randomly distributed with a mean of zero.
It is sometimes helpful to examine plots of residuals to check for
non-random pattens that indicate problems with your model.
If you take an econometrics class, you will learn how to identify
violations of these assumptions and how to adapt the OLS model to
deal with these situations.

Keep in mind that a regression actually analyzes the statistical
*correlation *between one variable and a set of other variables.
It doesn't actually prove *causality*. It is only the context of your analysis that
lets you infer that the “independent” variables “cause” the variation in the
“dependent” variable.
Somebody else out there is probably using the same data
to prove that your dependent variable is "causing" one of your independent variables!

You should never force the regression line through the vvorigin (the “Constant is zero” check-box in the Excel utility) without a clear theoretical justification for doing so. It makes your model diagnostics unreliable.