# R data Hack

##
Description

**Problem #1. (15 pts)**

Use the Carseats data set from the package ISLR, do the following:

1.1. Get familiar with the data set. How many variables does the data set have? What is the class of the variable Urban? (3 pt)

1.2. Add a new column Sales.ratio which is the ratio of Sales v.s. Population in each region, to the data set, call the new data set Carseats_new. (4 pts)

1.3. Get a subset of Carseats_new that contains the data for Urban regions only and call it Carseats_urban. (4 pts)

1.4. Find the information for the top 10 Urban regions in terms of Sales.ratio. (4 pts)

**Problem #2. (12 pts)**

Download and import the csv file R_Data_Hack_Problem2 from Canvas under **Module R Data Hack** and rename it as testdata. Can you reject the hypothesis that the mean of Fertility Rate in 1960 is greater than in 2013 based on 98% confidence interval? Explain your rationale. (6 pts) Is it appropriate to use t test for this problem and why? (2 pts) Can you assume equal variance for the two groups? (2 pts) Clearly state your H0 and H1. (2 pt)

**Problem #3. (10 pts)**

The data set “R_Data_Hack_Problem3” under **Module R Data Hack** includes 10 subgroups and each subgroup has 20 measurements. Use the data set to do the following:

3.1. Set up X-Bar and R charts on this process. Is this process in statistical control? (5 pts)

3.2. If specifications are at [1.5, 3], what can you say about process capability? (5 pts)

**Problem #4. (13 pts)**

The data set “R_Data_Hack_Problem4” under **Module R Data Hack** includes **daily sales data** of a superstore from **Jan 15**, 2023. Use the data set to do the following:

4.1. Add time stamp to the dataset and name it myts which is a daily time series data, starting from Jan 15, 2023. (2 pts)

4.2. Using Average method and Drift method to forecast demand for the next 7 days from the last date of the dataset. (3 pts)

4.3. Get a plot with the predicted sales from the two models, add a legend. (3 pts)

4.4. Get the error measures of the two forecast models using 80/20 rule (namely 80% of the data to train forecast models and 20% data as the test set to check accuracy) and compare them; Which method looks more promising? Explain your rationale. (5 pts)

