Calculating Simple Linear Regression
Simple linear regression is a procedure that provides an estimate of the value of a dependent variable (outcome) based on the value of an independent variable (predictor). Knowing that estimate with some degree of accuracy, we can use regression analysis to predict the value of one variable if we know the value of the other variable (Cohen & Cohen, 1983). The regression equation is a mathematical expression of the influence that a predictor has on a dependent variable, based on some theoretical framework. For example, in Exercise 14, Figure 14-1 illustrates the linear relationship between gestational age and birth weight. As shown in the scatterplot, there is a strong positive relationship between the two variables. Advanced gestational ages predict higher birth weights.
A regression equation can be generated with a data set containing subjects’ x and y values. Once this equation is generated, it can be used to predict future subjects’ y values, given only their x values. In simple or bivariate regression, predictions are made in cases with two variables. The score on variable y (dependent variable, or outcome) is predicted from the same subject’s known score on variable x (independent variable, or predictor).
Research Designs Appropriate for Simple Linear Regression
Research designs that may utilize simple linear regression include any associational design (Gliner et al., 2009). The variables involved in the design are attributional, meaning the variables are characteristics of the participant, such as health status, blood pressure, gender, diagnosis, or ethnicity. Regardless of the nature of variables, the dependent variable submitted to simple linear regression must be measured as continuous, at the interval or ratio level.
Statistical Formula and Assumptions
Use of simple linear regression involves the following assumptions (Zar, 2010):
Data that are homoscedastic are evenly dispersed both above and below the regression line, which indicates a linear relationship on a scatterplot. Homoscedasticity reflects equal variance of both variables. In other words, for every value of x, the distribution of y values should have equal variability. If the data for the predictor and dependent variable are not homoscedastic, inferences made during significance testing could be invalid (Cohen & Cohen, 1983; Zar, 2010). Visual examples of homoscedasticity and heteroscedasticity are presented in Exercise 30.
In simple linear regression, the dependent variable is continuous, and the predictor can be any scale of measurement; however, if the predictor is nominal, it must be correctly coded. Once the data are ready, the parameters a and b are computed to obtain a regression equation. To understand the mathematical process, recall the algebraic equation for a straight line:
y=the dependent variable(outcome)y=the dependent variable (outcome)
x=the independent variable(predictor)x=the independent variable (predictor)
b=the slope of the lineb=the slope of the line
a=y-intercept(the point where the regression line intersects the y-axis)a=y-intercept (the point where the regression line intersects the y-axis)
No single regression line can be used to predict with complete accuracy every y value from every x value. In fact, you could draw an infinite number of lines through the scattered paired values (Zar, 2010). However, the purpose of the regression equation is to develop the line to allow the highest degree of prediction possible—the line of best fit. The procedure for developing the line of best fit is the method of least squares. The formulas for the beta (β) and slope (α) of the regression equation are computed as follows. Note that once the β is calculated, that value is inserted into the formula for α.
This example uses data collected from a study of students enrolled in a registered nurse to bachelor of science in nursing (RN to BSN) program (Mancini, Ashwill, & Cipher, 2014). The predictor in this example is number of academic degrees obtained by the student prior to enrollment, and the dependent variable was number of months it took for the student to complete the RN to BSN program. The null hypothesis is “Number of degrees does not predict the number of months until completion of an RN to BSN program.”
The data are presented in Table 29-1. A simulated subset of 20 students was selected for this example so that the computations would be small and manageable. In actuality, studies involving linear regression need to be adequately powered (Aberson, 2010; Cohen, 1988). Observe that the data in Table 29-1 are arranged in columns that correspond to 321the elements of the formula. The summed values in the last row of Table 29-1 are inserted into the appropriate place in the formula for b.
ENROLLMENT GPA AND MONTHS TO COMPLETION IN AN RN TO BSN PROGRAM
|(Number of Degrees)||(Months to Completion)|
The computations for the b and α are as follows:
Step 1: Calculate b.
From the values in Table 29-1, we know that n = 20, Σx = 20, Σy = 267, Σx2 = 30, and Σxy = 238. These values are inserted into the formula for b, as follows:
Step 2: Calculate α.
From Step 1, we now know that b = −2.9, and we plug this value into the formula for α.
Step 3: Write the new regression equation:
Step 4: Calculate R.
The multiple R is defined as the correlation between the actual y values and the predicted y values using the new regression equation. The predicted y value using the new equation is represented by the symbol ŷ to differentiate from y, which represents the actual y values in the data set. We can use our new regression equation from Step 3 to compute predicted program completion time in months for each student, using their number of academic degrees prior to enrollment in the RN to BSN Program. For example, Student #1 had earned 1 academic degree prior to enrollment, and the predicted months to completion for Student 1 is calculated as:
Thus, the predicted ŷ is 13.35 months. This procedure would be continued for the rest of the students, and the Pearson correlation between the actual months to completion (y) and the predicted months to completion (ŷ) would yield the multiple R value. In this example, the R = 0.638. The higher the R, the more likely that the new regression equation accurately predicts y, because the higher the correlation, the closer the actual y values are to the predicted ŷ values. Figure 29-1 displays the regression line where the x axis represents possible numbers of degrees, and the y axis represents the predicted months to program completion (ŷ values).
FIGURE 29-1 REGRESSION LINE REPRESENTED BY NEW REGRESSION EQUATION.
Step 5: Determine whether the predictor significantly predicts y.
To know whether the predictor significantly predicts y, the beta must be tested against zero. In simple regression, this is most easily accomplished by using the R value from Step 4:
The t value is then compared to the t probability distribution table (see Appendix A). The df for this t statistic is n − 2. The critical t value at alpha (α) = 0.05, df = 18 is 2.10 for a two-tailed test. Our obtained t was 3.52, which exceeds the critical value in the table, thereby indicating a significant association between the predictor (x) and outcome (y).
Step 6: Calculate R2.
After establishing the statistical significance of the R value, it must subsequently be examined for clinical importance. This is accomplished by obtaining the coefficient of determination for regression—which simply involves squaring the R value. The R2represents the percentage of variance explained in y by the predictor. Cohen describes R2 values of 0.02 as small, 0.15 as moderate, and 0.26 or higher as large effect sizes (Cohen, 1988). In our example, the R was 0.638, and, therefore, the R2was 0.407. Multiplying 0.407 × 100% indicates that 40.7% of the variance in months to program completion can be explained by knowing the student’s number of earned academic degrees at admission (Cohen & Cohen, 1983).
The R2 can be very helpful in testing more than one predictor in a regression model. Unlike R, the R2 for one regression model can be compared with another regression model that contains additional predictors (Cohen & Cohen, 1983). The R2 is discussed further in Exercise 30.
The standardized beta (β) is another statistic that represents the magnitude of the association between x and y. β has limits just like a Pearson r, meaning that the standardized β cannot be lower than −1.00 or higher than 1.00. This value can be calculated by hand but is best computed with statistical software. The standardized beta (β) is calculated by converting the x and y values to z scores and then correlating the x and y value using the Pearson r formula. The standardized beta (β) is often reported in literature instead of the unstandardized b, because b does not have lower or upper limits and therefore the magnitude of b cannot be judged. β, on the other hand, is interpreted as a Pearson r and the descriptions of the magnitude of β can be applied, as recommended by Cohen (1988). In this example, the standardized beta (β) is −0.638. Thus, the magnitude of the association between x and y in this example is considered a large predictive association (Cohen, 1988).
This is how our data set looks in SPSS.
Step 1: From the “Analyze” menu, choose “Regression” and “Linear.”
Step 2: Move the predictor, Number of Degrees, to the space labeled “Independent(s).” Move the dependent variable, Number of Months to Completion, to the space labeled “Dependent.” Click “OK.”
Interpretation of SPSS Output
The following tables are generated from SPSS. The first table contains the multiple R and the R2 values. The multiple R is 0.638, indicating that the correlation between the actual y values and the predicted y values using the new regression equation is 0.638. The R2 is 0.407, indicating that 40.7% of the variance in months to program completion can be explained by knowing the student’s number of earned academic degrees at enrollment.
The second table contains the ANOVA table. As presented in Exercises 18 and 33, the ANOVA is usually performed to test for differences between group means. However, ANOVA can also be performed for regression, where the null hypothesis is that “knowing the value of x explains no information about y”. This table indicates that knowing the value of x explains a significant amount of variance in y. The contents of the ANOVA table are rarely reported in published manuscripts, because the significance of each predictor is presented in the last SPSS table titled “Coefficients” (see below).
The third table contains the b and a values, standardized beta (β), t, and exact p value. The a is listed in the first row, next to the label “Constant.” The β is listed in the second row, next to the name of the predictor. The remaining information that is important to extract when interpreting regression results can be found in the second row. The standardized beta (β) is −0.638. This value has limits just like a Pearson r, meaning that the standardized β cannot be lower than −1.00 or higher than 1.00. The t value is −3.516, and the exact p value is 0.002.
Final Interpretation in American Psychological Association (APA) Format
The following interpretation is written as it might appear in a research article, formatted according to APA guidelines (APA, 2010). Simple linear regression was performed with number of earned academic degrees as the predictor and months to program completion as the dependent variable. The student’s number of degrees significantly predicted months to completion among students in an RN to BSN program, β = −0.638, p = 0.002, and R2 = 40.7%. Higher numbers of earned academic degrees significantly predicted shorter program completion time.
Answers to Study Questions
Data for Additional Computational Practice for the Questions to be Graded
Using the example from Mancini and colleagues (2014), students enrolled in an RN to BSN program were assessed for demographics at enrollment. The predictor in this example is age at program enrollment, and the dependent variable was number of months it took for the student to complete the RN to BSN program. The null hypothesis is: “Student age at enrollment does not predict the number of months until completion of an RN to BSN program.” The data are presented in Table 29-2. A simulated subset of 20 students was randomly selected for this example so that the computations would be small and manageable.
AGE AT ENROLLMENT AND MONTHS TO COMPLETION IN AN RN TO BSN PROGRAM
|(Student Age)||(Months to Completion)|
EXERCISE 29 Questions to Be Graded
Name: _______________________________________________________ Class: _____________________
Follow your instructor’s directions to submit your answers to the following questions for grading. Your instructor may ask you to write your answers below and submit them as a hard copy for grading. Alternatively, your instructor may ask you to use the space below for notes and submit your answers online at http://evolve.elsevier.com/Grove/Statistics/ under “Questions to Be Graded.”
Grove, Susan K., Daisha Cipher. Statistics for Nursing Research: A Workbook for Evidence-Based Practice, 2nd Edition. Saunders, 022016. VitalBook file.
The citation provided is a guideline. Please check each citation for accuracy before use.
Calculating Pearson Chi-Square
The Pearson chi-square test (χ2) compares differences between groups on variables measured at the nominal level. The χ2 compares the frequencies that are observed with the frequencies that are expected. When a study requires that researchers compare proportions (percentages) in one category versus another category, the χ2is a statistic that will reveal if the difference in proportion is statistically improbable.
A one-way χ2 is a statistic that compares different levels of one variable only. For example, a researcher may collect information on gender and compare the proportions of males to females. If the one-way χ2 is statistically significant, it would indicate that proportions of one gender are significantly higher than proportions of the other gender than what would be expected by chance (Daniel, 2000). If more than two groups are being examined, the χ2 does not determine where the differences lie; it only determines that a significant difference exists. Further testing on pairs of groups with the χ2 would then be warranted to identify the significant differences.
A two-way χ2 is a statistic that tests whether proportions in levels of one nominal variable are significantly different from proportions of the second nominal variable. For example, the presence of advanced colon polyps was studied in three groups of patients: those having a normal body mass index (BMI), those who were overweight, and those who were obese (Siddiqui, Mahgoub, Pandove, Cipher, & Spechler, 2009). The research question tested was: “Is there a difference between the three groups (normal weight, overweight, and obese) on the presence of advanced colon polyps?” The results of the χ2 test indicated that a larger proportion of obese patients fell into the category of having advanced colon polyps compared to normal weight and overweight patients, suggesting that obesity may be a risk factor for developing advanced colon polyps. Further examples of two-way χ2 tests are reviewed in Exercise 19.
Research Designs Appropriate for the Pearson χ2
Research designs that may utilize the Pearson χ2 include the randomized experimental, quasi-experimental, and comparative designs (Gliner, Morgan, & Leech, 2009). The variables may be active, attributional, or a combination of both. An active variable refers to an intervention, treatment, or program. An attributional variable refers to a characteristic of the participant, such as gender, diagnosis, or ethnicity. Regardless of the whether the variables are active or attributional, all variables submitted to χ2 calculations must be measured at the nominal level.
Statistical Formula and Assumptions
Use of the Pearson χ2 involves the following assumptions (Daniel, 2000):
The test is distribution-free, or nonparametric, which means that no assumption has been made for a normal distribution of values in the population from which the sample was taken (Daniel, 2000).
The formula for a two-way χ2 is:
The contingency table is labeled as follows. A contingency table is a table that displays the relationship between two or more categorical variables (Daniel, 2000):
With any χ2 analysis, the degrees of freedom (df) must be calculated to determine the significance of the value of the statistic. The following formula is used for this calculation:
R=Number of rowsR=Number of rows
C=Number of columnsC=Number of columns
A retrospective comparative study examined whether longer antibiotic treatment courses were associated with increased antimicrobial resistance in patients with spinal cord injury (Lee et al., 2014). Using urine cultures from a sample of spinal cord–injured veterans, two groups were created: those with evidence of antibiotic resistance and those with no evidence of antibiotic resistance. Each veteran was also divided into two groups based on having had a history of recent (in the past 6 months) antibiotic use for more than 2 weeks or no history of recent antibiotic use.
The data are presented in Table 35-1. The null hypothesis is: “There is no difference between antibiotic users and non-users on the presence of antibiotic resistance.”
ANTIBIOTIC RESISTANCE BY ANTIBIOTIC USE
|Antibiotic Use||No Recent Use|
The computations for the Pearson χ2 test are as follows:
Step 1: Create a contingency table of the two nominal variables:
|Used Antibiotics||No Recent Use||Totals|
Step 2: Fit the cells into the formula:
Step 3: Compute the degrees of freedom:
Step 4: Locate the critical χ2 value in the χ2 distribution table (Appendix D) and compare it to the obtained χ2 value.
The obtained χ2 value is compared with the tabled χ2 values in Appendix D. The table includes the critical values of χ2 for specific degrees of freedom at selected levels of significance. If the value of the statistic is equal to or greater than the value identified in the χ2 table, the difference between the two variables is statistically significant. The critical χ2 for df = 1 is 3.84, and our obtained χ2 is 4.20, thereby exceeding the critical value and indicating a significant difference between antibiotic users and non-users on the presence of antibiotic resistance.
Furthermore, we can compute the rates of antibiotic resistance among antibiotic users and non-users by using the numbers in the contingency table from Step 1. The antibiotic resistance rate among the antibiotic users can be calculated as 8 ÷ 14 = 0.571 × 100% = 57.1%. The antibiotic resistance rate among the non-antibiotic users can be calculated as 7 ÷ 28 = 0.25 × 100% = 25%.
The following screenshot is a replica of what your SPSS window will look like. The data for subjects 24 through 42 are viewable by scrolling down in the SPSS screen.
Step 1: From the “Analyze” menu, choose “Descriptive Statistics” and “Crosstabs.” Move the two variables to the right, where either variable can be in the “Row” or “Column” space.
Step 2: Click “Statistics” and check the box next to “Chi-square.” Click “Continue” and “OK.”
Interpretation of SPSS Output
The following tables are generated from SPSS. The first table contains the contingency table, similar to Table 35-1 above. The second table contains the χ2results.
The last table contains the χ2 value in addition to other statistics that test associations between nominal variables. The Pearson χ2 test is located in the first row of the table, which contains the χ2 value, df, and p value.
Final Interpretation in American Psychological Association (APA) Format
The following interpretation is written as it might appear in a research article, formatted according to APA guidelines (APA, 2010). A Pearson χ2 analysis indicated that antibiotic users had significantly higher rates of antibiotic resistance than those who did not use antibiotics, χ2(1) = 4.20, p = 0.04 (57.1% versus 25%, respectively). This finding suggests that extended antibiotic use may be a risk factor for developing resistance, and further research is needed to investigate resistance as a direct effect of antibiotics.
Answers to Study Questions
Data for Additional Computational Practice for Questions to be Graded
A retrospective comparative study examining the presence of candiduria (presence of Candida species in the urine) among 97 adults with a spinal cord injury is presented as an additional example. The differences in the use of antibiotics were investigated with the Pearson χ2 test (Goetz, Howard, Cipher, & Revankar, 2010). These data are presented in Table 35-2 as a contingency table.
CANDIDURIA AND ANTIBIOTIC USE IN ADULTS WITH SPINAL CORD INJURIES
|No antibiotic use||0||39||39|
Hi there! Click one of our representatives below and we will get back to you as soon as possible.