Regression Intro 2: Univariate Models

Drawing trend lines through clouds of data points is called "regression analysis." When people draw trend lines with a straightedge through noisy data, they naturally tend to fit a line that minimizes the overall deviations of the points from the line. More specifically, they appear to minimize the sum of the squared deviations of the points from the line, since outlier points seem to exert more "pull" on the line than points closer to the line. So it makes sense to have a computer use that same criterion when calculating a trend line.

  1. Download the demand data that you graphed and analyzed in the previous exercise into two columns of an Excel spreadsheet. (If this doesn't download correctly, just enter the data from the previous exercise into an empty spreadsheet. Don't type in the dollar signs; use Format - Cells to format the price column to currency if you want.)

  2. Excel's Regression tool is found under the Tools--Data Analysis menu. (If you don't see Data Analysis as an option in the Tools menu, use Tools--Add-Ins to load the "Analysis TookPak.") Specify the data in the Q column as the "Input Y Range," the data in the P column as the "Input X Range," some cell to the right of your data as the upper-left corner of the "Output Range," check the "Residuals" box to obtain predicted Q and residual for each observation, then "OK" to run the regression.

    You should get output that looks like this:

    There are lots of diagnostics provided here, but usually you only need to check a few noted in the figure. The most important are the t-statistics for the regression coefficients: do they exceed (in absolute value) the threshold value for whatever confidence level you want? A commonly-used confidence level is 95%, which corresponds to a t-statistic of about 2 (as long as you have any reasonable number of observations). The "null hypothesis" is that the coefficients are not significantly different from zero. Given the high t-statistics for these coefficients, we can reject this null hypothesis with very high (>99.99%) confidence.

    Plugging in the coefficient estimates, we obtain the empirical demand equation: Q = 67.58 - 37.64P + e where e is the deviation of the actual data from the predicted line. Check this against the equation you derived in the first exercise: is it reasonably consistent?

  3. The Residuals output includes the predicted values of Q and the residual error (actual Q minus predicted Q). The model diagnostics, including coefficient t-statistics, are derived from the distribution of these residuals. Here we compare predicted vs. actual Q in an X-Y plot. (Note that the axes are necessarily switched, since we are graphing two columns of Q values against P.)

Hopefully your regression results are very consistent with the equation you derived in the first exercise. If they aren't, check your work.

Once you're comfortable with this analysis, try a multivariate regression analysis!