Description
5 attachments
Slide 1 of 5

attachment_1attachment_1

attachment_2attachment_2

attachment_3attachment_3

attachment_4attachment_4

attachment_5attachment_5
Unformatted Attachment Preview
BU.510.650 Data Analytics
Assignment # 5
Please submit two documents: Your answers to each part of every question in .pdf or .doc format, and
your R script, in .R format. In your document with answers, please do not respond with R output only.
While it is okay to include R output in that document, please make sure you spell out the response to
the question asked. Please submit your assignment through Blackboard and name your files using the
convention LastName_FirstName_AssignmentNumber. For example, Yazdi_Mohammad_5.pdf and
Yazdi_Mohammad_5.R.
For answering questions 1 and 2: Please watch Advertising Example and Toyota Example recording of
class, explaining Linear Regression in R.
For answering questions 3: Please watch Logistic Regression in R recording of class, explaining Logistic
Regression in R.
1. This question involves the use of simple linear regression on the Bikeshare data set (adapted from a
data set of bike rentals from DCs Capital Bikeshare system see the following url for details:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset). The following is a brief description
of the data, which is in the file Bikeshare.csv on Blackboard.
Temperature normalized temperature in Celsius, derived according to: (temperature on that day – t_min)/(t_max t_min), where t_min = 8, t_max = +39 (minimum and maximum temperatures encountered during the time period
the data was collected).
Humidity normalized humidity, derived according to: Humidity (measured on a scale of 0 to 100) on that day / 100.
Windspeed normalized windspeed in km/h, derived according to: Windspeed on that day / wind_max, where
wind_max = 67, the fastest wind encountered during the time period the data was collected.
Rentals number of bikes rented on that day.
Hint: Keep the dataset in the normalized values and do NOT change the normalized to original values.
a) First, read the data in Bikeshare.csv to a data frame called Bikeshare. Use the lm() function to
run a simple linear regression with Rentals as the output variable and Temperature as the input
variable. Use the summary() function to print the results.
Comment on the output. Specifically: Does temperature have a statistically significant effect
on the number of rentals?
What is the effect of a one degree (Celsius) change in temperature on the rentals? Hint:
The answer to this question is the same as the answer to the following question:
what is the effect of a 1/47 degree Celsius change in normalized temperature on the rental
b) Repeat part (a), but this time with Humidity as the input variable.
c) Repeat part (a), but this time with Windspeed as the input variable.
d) Check the R2 value you obtained in part (c). You will notice that it is very small. How do you
reconcile the small R2 value with your answer for part (c)?
e) Plot Rentals versus Temperature, and display the regression line on the plot, that is, the line
that shows how Rentals changes with respect to Temperature according to your regression. The
following command will produce such a line: abline(…, lwd = 5, col = red). Here,
should be replaced with the name of the variable where you stored your regression results, lwd =
5 specifies the width of the line, and col = red makes it a red line.
f)
The goal of this part is to introduce you to a useful plot type, called scatter plot matrix. Obtain a
scatter plot matrix of all variables (except the variable Day) using the following command:
pairs(~ Rentals + Temperature + Humidity + Windspeed, data=Bikeshare)
Study the graph you obtained. Which input variables appear to have an effect on Rentals?
g) Run multiple linear regression using all variables, except Day, as input variables. Provide the
summary information. Which input variables have a statistically significant effect on Rentals?
Justify your answer.
h) What is the predicted number of rentals on a day when the temperature is 15 degrees Celsius,
humidity is 50 (out of 100), and the windspeed is 5 km/h?
2. In this question, you will work on the updated Bikeshare dataset. In particular, you will check
whether weekends, in addition to weather conditions, affect rental patterns. In addition to all the
previous data, the updated Bikeshare dataset has the following data:
Weekday goes from 0 to 6, with 0 indicating that the day was Sunday, 1 indicating that the day was Monday, etc.
Registered number of bikes rented by registered users on that day.
Casual number of bikes rented by casual users on that day.
To start your work on this question, read the data in Bikeshare_updated.csv to a data frame called
BikeshareUpdated. Then, create a new column in your data frame called Weekend, which shows
1 if the day is a Saturday or Sunday, and 0 otherwise. (R Hint: In R, the or operator is the symbol .
For example, (x == 5)  (x == 6) will return TRUE if x is 5 or 6.)
(a) Run a multiple linear regression with Rentals as the output variable and Temperature,
Humidity, Windspeed, and Weekend as input variables. Comment on the output: Which input
variables have a statistically significant effect on the number of rentals?
(b) Run a multiple linear regression with Registered as the output variable and Temperature,
Humidity, Windspeed, and Weekend as input variables. Comment on the output: Which input
variables have a statistically significant effect on the number of rentals by registered users?
(c) Run a multiple linear regression with Casual as the output variable and Temperature,
Humidity, Windspeed, and Weekend as input variables. Comment on the output: Which input
variables have a statistically significant effect on the number of rentals by casual users?
(d) Compare and contrast your results from the previous three parts to answer the following question:
How does the weekend affect rental patterns?
3. In this question, you will use logistic regression on an adaptation of the Titanic data set from the first
class to predict whether a passenger will survive or not.
To begin your work on this question, first read the data from the file “TitanicforLogReg.csv” to a
data frame named Titanic. (Note: Please review the data before proceeding. You will notice that it
has five columns: Survived, Gender, Child, Fare, Class, and three of them Gender, Fare,
Class are categorical variables that R will convert to 01 columns when you run logistic
regression.)
Next, split the data into training data and test data, using random selection. Include half of the
records in the training data and the rest in the test data. Remember to include set.seed(1) before
the random selection in your code, so we all end up making the same split.
(a) What is the proportion of passengers who survived in the training data, and the proportion of
passengers who survived in the test data?
(b) Run logistic regression on the training data, with Survived as the response variable and Gender,
Child, Fare, Class as predictor variables. Display a summary of the results. Examine the output:
Which predictors are statistically significant? Which predictors are not statistically significant?
(c) Based on part (b), remove the predictors that are not statistically significant, and run logistic
regression again on the training data. Display a summary of the results. Examine the output: Are all
remaining predictors statistically significant?
(d) Using your regression results from part (c), predict the probability of survival for each passenger in
the test data. Using these probabilities, assign each passenger in the test data a final prediction of 1
(will survive) or 0 (will not survive). When making this final prediction, adopt the following rule:
If the passengers probability of survival is greater than 0.5, then we predict the passenger will
survive, otherwise we predict the passenger will not survive.
(e) Compute the accuracy of the predictions you made for the test data: What is the percentage of
passengers for whom your prediction was accurate?
Two basic types of analysis
Data Analysis
Supervised learning
Predict the value of an output variable given
the values of input variables, for example:
Estimate a persons annual wage, given
the persons age, education, gender,
industry, etc.
Predict whether S&P 500 will go up or
down the next day, given how the
market moved in the last n days
Methods:
Regression
Classification
Unsupervised learning
Draw conclusions from data in the absence
of a clearly defined output variable, for
example:
Given demographic and purchase data
about customers of a supermarket, find
out which customer characteristics lead
to similar shopping behavior
From thousands of genes that may or
may not be present in each person,
determine which ones might increase
the risk of cancer
Methods:
Clustering
Principle component analysis
2
For a prediction method: What data looks like
X1
X2
i=1
x11
x12
i=2
x21
Xj
Xp
Y
x1j
x1p
y1
x22
x2j
x2p
y2
xi1
xi2
xij
xip
yi
xn1
xn2
xnj
xnp
yn
i
i=n
Data / points / instances / examples / samples / records: ROWS
Input variables / independent variables / features / attributes / dimensions / covariates /
predictors / regressors / factors: COLUMNS
Output variable / outcome / response / label / dependent variable: COLUMN TO BE PREDICTED
3
The setup for a prediction method
Obtain some kind of model based
on observations xij for i = 1, .., n, j =
1, , p (aka training data), i.e.,
determine how input variables X1,
X2, , Xp influence output variable Y
Training data
Output
Learn
Model
Use that model to predict the
output variable for a data set that
comes from the same distribution
as training data, but you have not
seen before (aka test data)
Apply model
Test data
4
Linear regression
Linear regression: fundamental starting point for all types of regression models
Assumes the value of the output variable is a linear combination of the values of input variables,
i.e., the value of the output variable = a constant times the value of an input variable ± a constant times the value of an
input variable ±
If we have only one input variable, it is called simple linear regression
If we have more than one input variables, it is called multiple linear regression
Useful for prediction when the output variable takes on quantitative values
5
Example: Advertising
Sales (in 000s), advertising budget on TV (in $000s), advertising budget on Radio (in $000s),
advertising budget on newspaper (in $000s)
200 records (observations)
QUESTION: How does the advertising budget on TV, Radio, and Newspaper affect Sales?
The R code for this example is in Advertising.R
TV
Radio
Newspaper
Sales
1
2
3
200
6
Simple linear regression
It assumes that there is an approximately linear relationship between the output variable Y and the
single input variable X.
Mathematically:
?? ? ??0 + ??1 ??, or
?? = ??0 + ??1 ?? + ??, where ?? is a random error that varies across observations
Unknown coefficients: ??0 is called the intercept, and ??1 is called the slope.
The purpose of regression:
?0 and ??
?1
(1) Use training data to estimate ??0 and ??1 — the estimates are denoted ??
(2) For a given value of the input variable X, say x, estimate the value of the output variable Y, denoted
by ??? and given by
?0 + ?
??? = ??
??1 ??
7
8
Estimating the coefficients ??0 and ??1
Suppose the training data consists of n observations, given by the following pairs:
(x1, y1), (x2, y2), , (xn, yn)
?0 + ??
?1???? be the prediction for Y based on the ith observation
Let y??? = ??
Then, ???? = ???? ? ????? yields the ith residual (could be positive or negative)
The residual sum of squares (RSS) is defined as
?????? = ??12 + ??22 + ? + ????2 , or equivalently
?0 ? ??
?1??1 2 + ? + ???? ? ??
?0 ? ??
?1???? 2
?????? = ??1 ? ??
?0 and ??
?1 to minimize the RSS.
We choose ??
9
Estimating the coefficients ??0 and ??1
Therefore:
?0 ? ??
?1??1 2 + ? + ???? ? ??
?0 ? ??
?1???? 2
min ??1 ? ??
? ,??
?
??
0 1
?0 and ??
?1 that solve the above minimization problem are given by:
??
???
?1 = ??=1?????? ???? ???????? ,
??
???=1 ???? ???? 2
and
?0 = ??? ? ??
?1???
??
where ??? and ??? are sample means, i.e.,
??? =
1 ??
? ??
?? ??=1 ??
??? =
1 ??
? ??
?? ??=1 ??
10
?0 and ??
?1
Standard errors of the estimators ??
?0 and ??
?1
The following formulas yield the standard errors associated with ??
2
1
??
?
?0 = ?? 2 +
???? ??
?? ?????=1 ???? ? ??? 2
2
??
? =
???? ??
1
?????=1 ???? ? ??? 2
where ?? 2 is the variance of ?? and can be estimated by the residual standard error (RSE):
?????? = ??????/ ?? ? 2
?0 and SE ??
?1 , the following are the 95% confidence intervals for the true
Once we have SE ??
values of ??0 and ??1 :
?0 ? 2 ? SE ??
?0 , ??
?0 + 2 ? SE ??
?0
??
?1 ? 2 ? SE ??
?1 , ??
?1 + 2 ? SE ??
?1
??
11
Null hypothesis, H0: There is no relationship between X and Y.
Alternative hypothesis, Ha: There is some relationship between X and Y.
Mathematically:
Null hypothesis, H0: ??1 = 0
Alternative hypothesis, Ha: ??1 ? 0
To test the hypothesis, we compute the tstatistic, given by:
?1 ? 0
??
??=
?1
???? ??
12
Hypothesis test
pvalue corresponding to a given value of t: Assuming that ??1 = 0, what is the probability that
we would observe a tstatistic equal to t or larger?
If pvalue is small, then the interpretation is: It is unlikely that we will observe this value of t
when ??1 = 0.
Therefore, if pvalue is small, we reject the null hypothesis and conclude that ??1 ? 0.
How small should the pvalue be so that we can reject the null hypothesis?
Typical pvalue cutoffs are 5% or 1%.
13
Accuracy of the model
The quality of fit in a linear regression is typically assessed using two related quantities: the
residual standard error (RSE) and the R2 statistic.
The formula for residual standard error:
?????? =
1
?????? =
???2
??
1
? ???? ? ????? 2
???2
??=1
The formula for R2:
??2 =
?????? ? ??????
??????
=1?
??????
??????
where TSS is the total sum of squares, i.e.,
?????? = ? ???? ? ??? 2
14
Accuracy of the model
R2 statistic is also called the coefficient of determination.
R2 measures the proportion of variability in Y that can be explained using X.
An R2 statistic that is close to 1 indicates that the regression explained a large proportion of the
variability in the output variable Y.
An R2 statistic close to 0 indicates that the regression did not explain much of the variability in the
output variable Y;
this might occur because the linear model is wrong, or
the inherent error ?? 2 is high,
or both.
15
Multiple linear regression
Now we have more than one input variable:
?? ? ??0 + ??1 ??1 + ??2 ??2 + ? + ???? ????
?
We interpret ??
?? as the average effect on Y of a oneunit increase in Xj, holding all other
predictors constant.
The training data consists of n observations, (x11, x12, , x1p, y1), (x21, x22, , x2p, y2), , (xn1, xn2, ,
xnp, yn). In other words, the ith observation is denoted (xi1, xi2, , xip, yi).
Visually:
16
Multiple linear regression
Once again, the purpose of regression is:
?0 , ??
?1 , ??
?2 , , ??
??? .
(1) to use training data to estimate ??0 , ??1 , ??2 , , ???? — the estimates are denoted ??
(2) For any given set of values for input variables, say (x1, x2, , xp), to estimate the value of the
output variable Y the estimate, denoted by ??,
? will be
?0 + ??
?1 ??1 + ??
?2 ??2 + ? + ??
??? ???? .
??? = ??
17
18
Multiple linear regression estimating the coefficients
?0 + ??
?1????1 + ? + ??
?????????.
The prediction for Y based on the ith observation is now y??? = ??
?0 , ??
?1 , ??
?2 , , ??
??? are still chosen so as to minimize
Then, ???? = ???? ? ????? is still the ith residual, and ??
the residual sum of squares (RSS), which is still given by
?????? = ??12 + ??22 + ? + ????2 , or equivalently
?0 ? ??
?1??11 ? ? ? ??
?????1?? 2 + ? + ???? ? ??
?0 ? ??
?1????1
????????? 2 ,
?????? = ??1 ? ??
??
? ??
?0 ? ??
?1????1
that is, ?????? = ?????=1 ???? ? ??
2
?
????
,
? ? ? ?? ????
19
Multiple linear regression Hypothesis test
Null hypothesis, H0: There is no relationship between Y and X1, X2, , Xp.
Alternative hypothesis, Ha: There exists a relationship between Y and X1, X2, , Xp.
Mathematically:
Null hypothesis, H0: ??1 = ??2 = ? ???? = 0
Alternative hypothesis, Ha: At least one ???? ? 0
To test the hypothesis, we compute the Fstatistic, given by:
?????? ? ?????? /??
??=
??????/ ?? ? ?? ? 1
Large Fstatistic provides evidence against the null hypothesis H0.
20
Advertising Example: Understanding the regression output
The following is the output we got when we ran a linear regression between Sales and TV budget
Shows the regression we just ran
The min, max, first and third
quartiles, and median of all residuals
the
estimates of
regression
coefficients
The standard
error of the
estimated
coefficient
The statistic from
which pvalue is
estimated
21
Advertising Example: Understanding the regression output
Residual standard error: this reports the standard error of the residuals that is, the sample
standard deviation.
R2 : R2 is a measure of the models accuracy. Bigger is better.
F statistic: the F statistic tells you whether the model is significant or insignificant. The model is
significant if any of the coefficients are nonzero. Conventionally, a pvalue of less than 0.05
indicates that the model is likely significant (one or more ??i are nonzero)
Most people look at the R2 statistic first. The statistician wisely starts with the F statistic (or pvalue), for if the model is not significant then nothing else matters.
How about adjusted R2 see the next few slides.
22
Example: Advertising
Let us plot the Sales against TV budget.
Let us run linear regression between Sales and TV budget
Plot the Sales against Newspaper budget.
Run linear regression between Sales and Newspaper budget
23
Advertising Example: Model comparison / selection
Let us run 5 different regressions: (1) with TV only, (2) with Radio only, (3) with Newspaper only, (4)
with TV and Radio, (5) with all three
TV
1
Radio
Newspaper
?
?
2
?
3
4
?
?
5
?
?
?
R2
Adjusted R2
0.6119
0.6099
0.332
0.3287
0.05212
0.04733
0.8972
0.8962
0.8972
0.8956
QUESTION: Which model would you use to predict future sales?
24
Adjusted R2 & Overfitting
R2 will keep growing if we keep adding more input variables to our regression. The models
ability to fit the training data cannot become worse when we use an extra input variable.
However, if we keep adding more and more input variables to our model, we will end up
modeling the noise in our training data this is called overfitting and the model will not be
useful for prediction purposes.
In contrast to R2, adjusted R2 eventually starts decreasing as we keep adding input variables,
because it adjusts the original R2 for the number of input variables in the model. It increases
only if the new input variable improves the model significantly.
25
Overfitting Problem
26
Making predictions
Suppose we decided to include only TV and Radio advertising in our regression model.
Question: What is the predicted Sales in a city where the advertising budget is $100K, and radio
budget is $50K?
predict(lm(Sales~TV+Radio, data=ad), data.frame(TV=100, Radio=50))
27
Interactions
What if the effect of $1 spent on radio advertising depended on how much we spend on TV
advertising? For example, it might be that if we spend more on TV, then the effect of $1 spent on
radio increases.
More generally, consider multiple linear regression with two variables
?? ? ??0 + ??1 ??1 + ??2 ??2
We interpreted ??1 as the average effect on Y of a oneunit increase in X1, holding all other predictors (in
this case, X2) constant. When interactions are present, for example when X1s effect depends on X2, we can
no longer say ” ??1 is the average effect on Y of a oneunit increase in X1, holding all other predictors
constant because the effect of a oneunit increase in X1 now depends on the value of X2.
Therefore, we include an interaction term:
?? ? ??0 + ??1 ??1 + ??2 ??2 + ??3 ??1 ??2
28
Nonlinear relationships
Suppose we are modeling the effect of a cars horsepower on its mpg. The mpg could increase
not just in proportion to horsepower, but in proportion to the square of horsepower. In that
case, we could set up our regression in the following way:
?????? ? ??0 + ??1 × ??????????????????? + ??2 × ??????????????????? 2
This is no longer linear regression
This becomes a quadratic or polynomial regression (more on this later)
29
Example: Toyota Used Car Prices
TASK 1: Import the data file ToyotaCorolla.csv and see what the variables are.
TASK 2: Notice that FuelType is not a quantitative variable. It takes on one of three values: CNG,
Diesel, or Petrol.
We need to do a bit of data management. We will create indicator variables, CNGFuel (which will be
1 if the fuel type is CNG and 0 otherwise) and DieselFuel (which will be 1 if the fuel type is Diesel
and 0 otherwise). If both CNGFuel and DieselFuel are 0, we will know that the cars fuel type is
Petrol.
TASK 3: Create plots to see what variables might be influencing price. For example, plot (i) Price
versus Weight, (ii) Price versus KM, (iii) Price versus Automatic, etc.
30
Example: Toyota Used Car Prices
TASK 4: Run a regression including all input variables.
TASK 5: Run a regression after excluding one or more input variables. See how they compare to the
regression in TASK 4 to decide which variables to include in your final regression model.
Age
KM
HP
Metcolor
Automatic
CC
Doors
Weight
CNG
Fuel
Diesel
Fuel
R2
Adjusted
R2
1
2
3
4
5
31
Example: Toyota Used Car Prices
TASK 6: Introduce quadratic terms. Is it possible that the price depends on the square of Age and
square of KM?
TASK 7: Run regressions with Age and KM as your input variables, to see how including Age^2 and
KM^2 influences the model fit.
Age
KM
Age2
KM2
R2
Adjusted R2
1
2
3
4
5
TASK 8: Plot the residuals obtained when we run a regression with Age and KM as the only input
variables.
32
Regression
Regression is one of the most basic, but also most useful prediction methods.
Allows one to understand which input variables affect the output variable and the degree of that
effect
Allows one to ask what if questions
33
Back to linear regression
What if the response (output variable) is qualitative?
Eye color: blue, brown, green,
Type of pet you have: Dog, cat, fish,
How you get to school: Walk, drive, metro, Uber,
2
Binary response
Often the response (output variable) is binary the response is one of two types:
Flip a coin: Head or tails
Pay back a loan or default
Buy or not buy
Thumbs up or thumbs down
Prediction takes the form of classification: given the values of predictors (input variables), which
type of response will we get?
In this case, we could introduce a binary variable for the response, e.g.,
??= ?
1 ???? ?????????
0 ???? ??????????
3
Classification example: Game of Thrones
Will the character survive at the end of the next season: Yes or No?
Test data: Did they survive at
the end of Season 4?
Predictors related to
characters survival
Training data, say, from
the first three seasons
Model: The effect of each predictor on
survival at the end of the next season
4
Classification example: Credit card fraud
Predict, in realtime, whether a credit card transaction is fraudulent or not
For each credit card account, we have data about:
predictors
The account holder (gender, address, age, etc.)
Transaction data (what was bought, for how much, where, when, etc.)
Whether the transaction was fraudulent or not
We derive a model that relates predictors to the response
response
Each transaction, as it happens, is then classified as fraudulent or not.
CNN Politics in 2014: Obamas credit card declined at fancy restaurant. It was used in a GOP ad for
Valentines Day!
5
Classification example: Credit card default
Default data set in ISLR
For a number of individuals, we have data about:
Their income
Their credit card balance
predictors
Whether they were a student or not
Whether they defaulted on their credit card debt or not
response
We use the training data to model the relationship between defaults and predictors (income,
balance, and student or not)
We can then predict, for a new individual, whether he will default on his credit card debt
6
Logistic regression: main concept
Let Y be the binary response: we know Y is going to be either 0 or 1. In logistic regression,
instead of modeling Y directly as a function of the predictors, we model the probability that Y =
1. For example:
In the credit card default example: Let Y = 1 if an individual defaults on his credit card debt and 0
otherwise. Instead of modeling whether the individual will default or not, logistic regression models
the probability of default as a function of the predictors (income, balance, student or not).
In the GoT example: Let Y = 1 if a GoT character will die at the end of Season 7, and 0 otherwise.
Logistic regression then estimates the probability that the character will die at the end of Season 7, as
a function of the predictors.
In the credit card fraud example: Let Y = 1 if a credit card transaction is fraudulent and 0 otherwise.
Logistic regression model the probability that a transaction is fraudulent, as a function of the
predictors.
7
Logistic regression: main concept
Once we have a model of how predictors influence the probability that Y = 1, we can decide how
to convert that to a prediction about whether Y will be 1 or 0. For example:
In the credit card default example: If the probability that Y = 1 > 50%, then we predict that the
individual will default and we tak
Assignment # 5
Please submit two documents: Your answers to each part of every question in .pdf or .doc format, and
your R script, in .R format. In your document with answers, please do not respond with R output only.
While it is okay to include R output in that document, please make sure you spell out the response to
the question asked. Please submit your assignment through Blackboard and name your files using the
convention LastName_FirstName_AssignmentNumber. For example, Yazdi_Mohammad_5.pdf and
Yazdi_Mohammad_5.R.
For answering questions 1 and 2: Please watch Advertising Example and Toyota Example recording of
class, explaining Linear Regression in R.
For answering questions 3: Please watch Logistic Regression in R recording of class, explaining Logistic
Regression in R.
1. This question involves the use of simple linear regression on the Bikeshare data set (adapted from a
data set of bike rentals from DCs Capital Bikeshare system see the following url for details:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset). The following is a brief description
of the data, which is in the file Bikeshare.csv on Blackboard.
Temperature normalized temperature in Celsius, derived according to: (temperature on that day – t_min)/(t_max t_min), where t_min = 8, t_max = +39 (minimum and maximum temperatures encountered during the time period
the data was collected).
Humidity normalized humidity, derived according to: Humidity (measured on a scale of 0 to 100) on that day / 100.
Windspeed normalized windspeed in km/h, derived according to: Windspeed on that day / wind_max, where
wind_max = 67, the fastest wind encountered during the time period the data was collected.
Rentals number of bikes rented on that day.
Hint: Keep the dataset in the normalized values and do NOT change the normalized to original values.
a) First, read the data in Bikeshare.csv to a data frame called Bikeshare. Use the lm() function to
run a simple linear regression with Rentals as the output variable and Temperature as the input
variable. Use the summary() function to print the results.
Comment on the output. Specifically: Does temperature have a statistically significant effect
on the number of rentals?
What is the effect of a one degree (Celsius) change in temperature on the rentals? Hint:
The answer to this question is the same as the answer to the following question:
what is the effect of a 1/47 degree Celsius change in normalized temperature on the rental
b) Repeat part (a), but this time with Humidity as the input variable.
c) Repeat part (a), but this time with Windspeed as the input variable.
d) Check the R2 value you obtained in part (c). You will notice that it is very small. How do you
reconcile the small R2 value with your answer for part (c)?
e) Plot Rentals versus Temperature, and display the regression line on the plot, that is, the line
that shows how Rentals changes with respect to Temperature according to your regression. The
following command will produce such a line: abline(…, lwd = 5, col = red). Here,
should be replaced with the name of the variable where you stored your regression results, lwd =
5 specifies the width of the line, and col = red makes it a red line.
f)
The goal of this part is to introduce you to a useful plot type, called scatter plot matrix. Obtain a
scatter plot matrix of all variables (except the variable Day) using the following command:
pairs(~ Rentals + Temperature + Humidity + Windspeed, data=Bikeshare)
Study the graph you obtained. Which input variables appear to have an effect on Rentals?
g) Run multiple linear regression using all variables, except Day, as input variables. Provide the
summary information. Which input variables have a statistically significant effect on Rentals?
Justify your answer.
h) What is the predicted number of rentals on a day when the temperature is 15 degrees Celsius,
humidity is 50 (out of 100), and the windspeed is 5 km/h?
2. In this question, you will work on the updated Bikeshare dataset. In particular, you will check
whether weekends, in addition to weather conditions, affect rental patterns. In addition to all the
previous data, the updated Bikeshare dataset has the following data:
Weekday goes from 0 to 6, with 0 indicating that the day was Sunday, 1 indicating that the day was Monday, etc.
Registered number of bikes rented by registered users on that day.
Casual number of bikes rented by casual users on that day.
To start your work on this question, read the data in Bikeshare_updated.csv to a data frame called
BikeshareUpdated. Then, create a new column in your data frame called Weekend, which shows
1 if the day is a Saturday or Sunday, and 0 otherwise. (R Hint: In R, the or operator is the symbol .
For example, (x == 5)  (x == 6) will return TRUE if x is 5 or 6.)
(a) Run a multiple linear regression with Rentals as the output variable and Temperature,
Humidity, Windspeed, and Weekend as input variables. Comment on the output: Which input
variables have a statistically significant effect on the number of rentals?
(b) Run a multiple linear regression with Registered as the output variable and Temperature,
Humidity, Windspeed, and Weekend as input variables. Comment on the output: Which input
variables have a statistically significant effect on the number of rentals by registered users?
(c) Run a multiple linear regression with Casual as the output variable and Temperature,
Humidity, Windspeed, and Weekend as input variables. Comment on the output: Which input
variables have a statistically significant effect on the number of rentals by casual users?
(d) Compare and contrast your results from the previous three parts to answer the following question:
How does the weekend affect rental patterns?
3. In this question, you will use logistic regression on an adaptation of the Titanic data set from the first
class to predict whether a passenger will survive or not.
To begin your work on this question, first read the data from the file “TitanicforLogReg.csv” to a
data frame named Titanic. (Note: Please review the data before proceeding. You will notice that it
has five columns: Survived, Gender, Child, Fare, Class, and three of them Gender, Fare,
Class are categorical variables that R will convert to 01 columns when you run logistic
regression.)
Next, split the data into training data and test data, using random selection. Include half of the
records in the training data and the rest in the test data. Remember to include set.seed(1) before
the random selection in your code, so we all end up making the same split.
(a) What is the proportion of passengers who survived in the training data, and the proportion of
passengers who survived in the test data?
(b) Run logistic regression on the training data, with Survived as the response variable and Gender,
Child, Fare, Class as predictor variables. Display a summary of the results. Examine the output:
Which predictors are statistically significant? Which predictors are not statistically significant?
(c) Based on part (b), remove the predictors that are not statistically significant, and run logistic
regression again on the training data. Display a summary of the results. Examine the output: Are all
remaining predictors statistically significant?
(d) Using your regression results from part (c), predict the probability of survival for each passenger in
the test data. Using these probabilities, assign each passenger in the test data a final prediction of 1
(will survive) or 0 (will not survive). When making this final prediction, adopt the following rule:
If the passengers probability of survival is greater than 0.5, then we predict the passenger will
survive, otherwise we predict the passenger will not survive.
(e) Compute the accuracy of the predictions you made for the test data: What is the percentage of
passengers for whom your prediction was accurate?
Two basic types of analysis
Data Analysis
Supervised learning
Predict the value of an output variable given
the values of input variables, for example:
Estimate a persons annual wage, given
the persons age, education, gender,
industry, etc.
Predict whether S&P 500 will go up or
down the next day, given how the
market moved in the last n days
Methods:
Regression
Classification
Unsupervised learning
Draw conclusions from data in the absence
of a clearly defined output variable, for
example:
Given demographic and purchase data
about customers of a supermarket, find
out which customer characteristics lead
to similar shopping behavior
From thousands of genes that may or
may not be present in each person,
determine which ones might increase
the risk of cancer
Methods:
Clustering
Principle component analysis
2
For a prediction method: What data looks like
X1
X2
i=1
x11
x12
i=2
x21
Xj
Xp
Y
x1j
x1p
y1
x22
x2j
x2p
y2
xi1
xi2
xij
xip
yi
xn1
xn2
xnj
xnp
yn
i
i=n
Data / points / instances / examples / samples / records: ROWS
Input variables / independent variables / features / attributes / dimensions / covariates /
predictors / regressors / factors: COLUMNS
Output variable / outcome / response / label / dependent variable: COLUMN TO BE PREDICTED
3
The setup for a prediction method
Obtain some kind of model based
on observations xij for i = 1, .., n, j =
1, , p (aka training data), i.e.,
determine how input variables X1,
X2, , Xp influence output variable Y
Training data
Output
Learn
Model
Use that model to predict the
output variable for a data set that
comes from the same distribution
as training data, but you have not
seen before (aka test data)
Apply model
Test data
4
Linear regression
Linear regression: fundamental starting point for all types of regression models
Assumes the value of the output variable is a linear combination of the values of input variables,
i.e., the value of the output variable = a constant times the value of an input variable ± a constant times the value of an
input variable ±
If we have only one input variable, it is called simple linear regression
If we have more than one input variables, it is called multiple linear regression
Useful for prediction when the output variable takes on quantitative values
5
Example: Advertising
Sales (in 000s), advertising budget on TV (in $000s), advertising budget on Radio (in $000s),
advertising budget on newspaper (in $000s)
200 records (observations)
QUESTION: How does the advertising budget on TV, Radio, and Newspaper affect Sales?
The R code for this example is in Advertising.R
TV
Radio
Newspaper
Sales
1
2
3
200
6
Simple linear regression
It assumes that there is an approximately linear relationship between the output variable Y and the
single input variable X.
Mathematically:
?? ? ??0 + ??1 ??, or
?? = ??0 + ??1 ?? + ??, where ?? is a random error that varies across observations
Unknown coefficients: ??0 is called the intercept, and ??1 is called the slope.
The purpose of regression:
?0 and ??
?1
(1) Use training data to estimate ??0 and ??1 — the estimates are denoted ??
(2) For a given value of the input variable X, say x, estimate the value of the output variable Y, denoted
by ??? and given by
?0 + ?
??? = ??
??1 ??
7
8
Estimating the coefficients ??0 and ??1
Suppose the training data consists of n observations, given by the following pairs:
(x1, y1), (x2, y2), , (xn, yn)
?0 + ??
?1???? be the prediction for Y based on the ith observation
Let y??? = ??
Then, ???? = ???? ? ????? yields the ith residual (could be positive or negative)
The residual sum of squares (RSS) is defined as
?????? = ??12 + ??22 + ? + ????2 , or equivalently
?0 ? ??
?1??1 2 + ? + ???? ? ??
?0 ? ??
?1???? 2
?????? = ??1 ? ??
?0 and ??
?1 to minimize the RSS.
We choose ??
9
Estimating the coefficients ??0 and ??1
Therefore:
?0 ? ??
?1??1 2 + ? + ???? ? ??
?0 ? ??
?1???? 2
min ??1 ? ??
? ,??
?
??
0 1
?0 and ??
?1 that solve the above minimization problem are given by:
??
???
?1 = ??=1?????? ???? ???????? ,
??
???=1 ???? ???? 2
and
?0 = ??? ? ??
?1???
??
where ??? and ??? are sample means, i.e.,
??? =
1 ??
? ??
?? ??=1 ??
??? =
1 ??
? ??
?? ??=1 ??
10
?0 and ??
?1
Standard errors of the estimators ??
?0 and ??
?1
The following formulas yield the standard errors associated with ??
2
1
??
?
?0 = ?? 2 +
???? ??
?? ?????=1 ???? ? ??? 2
2
??
? =
???? ??
1
?????=1 ???? ? ??? 2
where ?? 2 is the variance of ?? and can be estimated by the residual standard error (RSE):
?????? = ??????/ ?? ? 2
?0 and SE ??
?1 , the following are the 95% confidence intervals for the true
Once we have SE ??
values of ??0 and ??1 :
?0 ? 2 ? SE ??
?0 , ??
?0 + 2 ? SE ??
?0
??
?1 ? 2 ? SE ??
?1 , ??
?1 + 2 ? SE ??
?1
??
11
Null hypothesis, H0: There is no relationship between X and Y.
Alternative hypothesis, Ha: There is some relationship between X and Y.
Mathematically:
Null hypothesis, H0: ??1 = 0
Alternative hypothesis, Ha: ??1 ? 0
To test the hypothesis, we compute the tstatistic, given by:
?1 ? 0
??
??=
?1
???? ??
12
Hypothesis test
pvalue corresponding to a given value of t: Assuming that ??1 = 0, what is the probability that
we would observe a tstatistic equal to t or larger?
If pvalue is small, then the interpretation is: It is unlikely that we will observe this value of t
when ??1 = 0.
Therefore, if pvalue is small, we reject the null hypothesis and conclude that ??1 ? 0.
How small should the pvalue be so that we can reject the null hypothesis?
Typical pvalue cutoffs are 5% or 1%.
13
Accuracy of the model
The quality of fit in a linear regression is typically assessed using two related quantities: the
residual standard error (RSE) and the R2 statistic.
The formula for residual standard error:
?????? =
1
?????? =
???2
??
1
? ???? ? ????? 2
???2
??=1
The formula for R2:
??2 =
?????? ? ??????
??????
=1?
??????
??????
where TSS is the total sum of squares, i.e.,
?????? = ? ???? ? ??? 2
14
Accuracy of the model
R2 statistic is also called the coefficient of determination.
R2 measures the proportion of variability in Y that can be explained using X.
An R2 statistic that is close to 1 indicates that the regression explained a large proportion of the
variability in the output variable Y.
An R2 statistic close to 0 indicates that the regression did not explain much of the variability in the
output variable Y;
this might occur because the linear model is wrong, or
the inherent error ?? 2 is high,
or both.
15
Multiple linear regression
Now we have more than one input variable:
?? ? ??0 + ??1 ??1 + ??2 ??2 + ? + ???? ????
?
We interpret ??
?? as the average effect on Y of a oneunit increase in Xj, holding all other
predictors constant.
The training data consists of n observations, (x11, x12, , x1p, y1), (x21, x22, , x2p, y2), , (xn1, xn2, ,
xnp, yn). In other words, the ith observation is denoted (xi1, xi2, , xip, yi).
Visually:
16
Multiple linear regression
Once again, the purpose of regression is:
?0 , ??
?1 , ??
?2 , , ??
??? .
(1) to use training data to estimate ??0 , ??1 , ??2 , , ???? — the estimates are denoted ??
(2) For any given set of values for input variables, say (x1, x2, , xp), to estimate the value of the
output variable Y the estimate, denoted by ??,
? will be
?0 + ??
?1 ??1 + ??
?2 ??2 + ? + ??
??? ???? .
??? = ??
17
18
Multiple linear regression estimating the coefficients
?0 + ??
?1????1 + ? + ??
?????????.
The prediction for Y based on the ith observation is now y??? = ??
?0 , ??
?1 , ??
?2 , , ??
??? are still chosen so as to minimize
Then, ???? = ???? ? ????? is still the ith residual, and ??
the residual sum of squares (RSS), which is still given by
?????? = ??12 + ??22 + ? + ????2 , or equivalently
?0 ? ??
?1??11 ? ? ? ??
?????1?? 2 + ? + ???? ? ??
?0 ? ??
?1????1
????????? 2 ,
?????? = ??1 ? ??
??
? ??
?0 ? ??
?1????1
that is, ?????? = ?????=1 ???? ? ??
2
?
????
,
? ? ? ?? ????
19
Multiple linear regression Hypothesis test
Null hypothesis, H0: There is no relationship between Y and X1, X2, , Xp.
Alternative hypothesis, Ha: There exists a relationship between Y and X1, X2, , Xp.
Mathematically:
Null hypothesis, H0: ??1 = ??2 = ? ???? = 0
Alternative hypothesis, Ha: At least one ???? ? 0
To test the hypothesis, we compute the Fstatistic, given by:
?????? ? ?????? /??
??=
??????/ ?? ? ?? ? 1
Large Fstatistic provides evidence against the null hypothesis H0.
20
Advertising Example: Understanding the regression output
The following is the output we got when we ran a linear regression between Sales and TV budget
Shows the regression we just ran
The min, max, first and third
quartiles, and median of all residuals
the
estimates of
regression
coefficients
The standard
error of the
estimated
coefficient
The statistic from
which pvalue is
estimated
21
Advertising Example: Understanding the regression output
Residual standard error: this reports the standard error of the residuals that is, the sample
standard deviation.
R2 : R2 is a measure of the models accuracy. Bigger is better.
F statistic: the F statistic tells you whether the model is significant or insignificant. The model is
significant if any of the coefficients are nonzero. Conventionally, a pvalue of less than 0.05
indicates that the model is likely significant (one or more ??i are nonzero)
Most people look at the R2 statistic first. The statistician wisely starts with the F statistic (or pvalue), for if the model is not significant then nothing else matters.
How about adjusted R2 see the next few slides.
22
Example: Advertising
Let us plot the Sales against TV budget.
Let us run linear regression between Sales and TV budget
Plot the Sales against Newspaper budget.
Run linear regression between Sales and Newspaper budget
23
Advertising Example: Model comparison / selection
Let us run 5 different regressions: (1) with TV only, (2) with Radio only, (3) with Newspaper only, (4)
with TV and Radio, (5) with all three
TV
1
Radio
Newspaper
?
?
2
?
3
4
?
?
5
?
?
?
R2
Adjusted R2
0.6119
0.6099
0.332
0.3287
0.05212
0.04733
0.8972
0.8962
0.8972
0.8956
QUESTION: Which model would you use to predict future sales?
24
Adjusted R2 & Overfitting
R2 will keep growing if we keep adding more input variables to our regression. The models
ability to fit the training data cannot become worse when we use an extra input variable.
However, if we keep adding more and more input variables to our model, we will end up
modeling the noise in our training data this is called overfitting and the model will not be
useful for prediction purposes.
In contrast to R2, adjusted R2 eventually starts decreasing as we keep adding input variables,
because it adjusts the original R2 for the number of input variables in the model. It increases
only if the new input variable improves the model significantly.
25
Overfitting Problem
26
Making predictions
Suppose we decided to include only TV and Radio advertising in our regression model.
Question: What is the predicted Sales in a city where the advertising budget is $100K, and radio
budget is $50K?
predict(lm(Sales~TV+Radio, data=ad), data.frame(TV=100, Radio=50))
27
Interactions
What if the effect of $1 spent on radio advertising depended on how much we spend on TV
advertising? For example, it might be that if we spend more on TV, then the effect of $1 spent on
radio increases.
More generally, consider multiple linear regression with two variables
?? ? ??0 + ??1 ??1 + ??2 ??2
We interpreted ??1 as the average effect on Y of a oneunit increase in X1, holding all other predictors (in
this case, X2) constant. When interactions are present, for example when X1s effect depends on X2, we can
no longer say ” ??1 is the average effect on Y of a oneunit increase in X1, holding all other predictors
constant because the effect of a oneunit increase in X1 now depends on the value of X2.
Therefore, we include an interaction term:
?? ? ??0 + ??1 ??1 + ??2 ??2 + ??3 ??1 ??2
28
Nonlinear relationships
Suppose we are modeling the effect of a cars horsepower on its mpg. The mpg could increase
not just in proportion to horsepower, but in proportion to the square of horsepower. In that
case, we could set up our regression in the following way:
?????? ? ??0 + ??1 × ??????????????????? + ??2 × ??????????????????? 2
This is no longer linear regression
This becomes a quadratic or polynomial regression (more on this later)
29
Example: Toyota Used Car Prices
TASK 1: Import the data file ToyotaCorolla.csv and see what the variables are.
TASK 2: Notice that FuelType is not a quantitative variable. It takes on one of three values: CNG,
Diesel, or Petrol.
We need to do a bit of data management. We will create indicator variables, CNGFuel (which will be
1 if the fuel type is CNG and 0 otherwise) and DieselFuel (which will be 1 if the fuel type is Diesel
and 0 otherwise). If both CNGFuel and DieselFuel are 0, we will know that the cars fuel type is
Petrol.
TASK 3: Create plots to see what variables might be influencing price. For example, plot (i) Price
versus Weight, (ii) Price versus KM, (iii) Price versus Automatic, etc.
30
Example: Toyota Used Car Prices
TASK 4: Run a regression including all input variables.
TASK 5: Run a regression after excluding one or more input variables. See how they compare to the
regression in TASK 4 to decide which variables to include in your final regression model.
Age
KM
HP
Metcolor
Automatic
CC
Doors
Weight
CNG
Fuel
Diesel
Fuel
R2
Adjusted
R2
1
2
3
4
5
31
Example: Toyota Used Car Prices
TASK 6: Introduce quadratic terms. Is it possible that the price depends on the square of Age and
square of KM?
TASK 7: Run regressions with Age and KM as your input variables, to see how including Age^2 and
KM^2 influences the model fit.
Age
KM
Age2
KM2
R2
Adjusted R2
1
2
3
4
5
TASK 8: Plot the residuals obtained when we run a regression with Age and KM as the only input
variables.
32
Regression
Regression is one of the most basic, but also most useful prediction methods.
Allows one to understand which input variables affect the output variable and the degree of that
effect
Allows one to ask what if questions
33
Back to linear regression
What if the response (output variable) is qualitative?
Eye color: blue, brown, green,
Type of pet you have: Dog, cat, fish,
How you get to school: Walk, drive, metro, Uber,
2
Binary response
Often the response (output variable) is binary the response is one of two types:
Flip a coin: Head or tails
Pay back a loan or default
Buy or not buy
Thumbs up or thumbs down
Prediction takes the form of classification: given the values of predictors (input variables), which
type of response will we get?
In this case, we could introduce a binary variable for the response, e.g.,
??= ?
1 ???? ?????????
0 ???? ??????????
3
Classification example: Game of Thrones
Will the character survive at the end of the next season: Yes or No?
Test data: Did they survive at
the end of Season 4?
Predictors related to
characters survival
Training data, say, from
the first three seasons
Model: The effect of each predictor on
survival at the end of the next season
4
Classification example: Credit card fraud
Predict, in realtime, whether a credit card transaction is fraudulent or not
For each credit card account, we have data about:
predictors
The account holder (gender, address, age, etc.)
Transaction data (what was bought, for how much, where, when, etc.)
Whether the transaction was fraudulent or not
We derive a model that relates predictors to the response
response
Each transaction, as it happens, is then classified as fraudulent or not.
CNN Politics in 2014: Obamas credit card declined at fancy restaurant. It was used in a GOP ad for
Valentines Day!
5
Classification example: Credit card default
Default data set in ISLR
For a number of individuals, we have data about:
Their income
Their credit card balance
predictors
Whether they were a student or not
Whether they defaulted on their credit card debt or not
response
We use the training data to model the relationship between defaults and predictors (income,
balance, and student or not)
We can then predict, for a new individual, whether he will default on his credit card debt
6
Logistic regression: main concept
Let Y be the binary response: we know Y is going to be either 0 or 1. In logistic regression,
instead of modeling Y directly as a function of the predictors, we model the probability that Y =
1. For example:
In the credit card default example: Let Y = 1 if an individual defaults on his credit card debt and 0
otherwise. Instead of modeling whether the individual will default or not, logistic regression models
the probability of default as a function of the predictors (income, balance, student or not).
In the GoT example: Let Y = 1 if a GoT character will die at the end of Season 7, and 0 otherwise.
Logistic regression then estimates the probability that the character will die at the end of Season 7, as
a function of the predictors.
In the credit card fraud example: Let Y = 1 if a credit card transaction is fraudulent and 0 otherwise.
Logistic regression model the probability that a transaction is fraudulent, as a function of the
predictors.
7
Logistic regression: main concept
Once we have a model of how predictors influence the probability that Y = 1, we can decide how
to convert that to a prediction about whether Y will be 1 or 0. For example:
In the credit card default example: If the probability that Y = 1 > 50%, then we predict that the
individual will default and we tak
Don't use plagiarized sources. Get Your Custom Essay on
R Question
Just from $10/Page
^{“Place your order now for a similar assignment and have exceptional work written by our team of experts, guaranteeing you A results.”}