美国普林斯顿大学项目管理课件 (Princeton University)：通往模型建立最佳子集合的使用项目A及项目B_美国作业_留学生作业

Project_A
项目A

Use of the best-subsets approach to model building
通往模型建立最佳子集合的使用

Consider the file UNIV&COL.xls showing data about universities and colleges concerning the type of term, location, the type of school, the average total SAT score, TOEFL score (less than 550, at least 550) for applicants of non-English speaker, room and board expenses, annual total cost, and the average indebtedness at graduation. The objective of this project is to find out if there is any relationship among variables using regression analysis techniques. You are to write a report about your findings after analyzing the data set. The following is a minimum guideline about what you should analyze. You need to do more in-depth analysis for a better grade than a C. For example, you may have to use such tools as confidence interval estimates, one or two-sample tests on the data to improve the quality of your report.
考虑到大专院校有关的类型来看，位置，类型的学校数据，平均总SAT成绩，托福成绩（低于550，至少550）对非英语的申请人的食宿费用，每年的总费用及毕业的平均债务的UNIV＆COL.xls文件。这个项目的目的是要找出是否有任何变量之间的关系，利用回归分析技术。请你写一份报告你的调查结果进行分析后的数据集。以下是什么，你应该分析的最低指引。你需要做更深入的分析，得到比C更好的成绩。例如，您可能需要使用这些工具的置信区间估计，一个或两个样本检验的数据质量提高你的报告。

a) State your statistical objective for this data set.
国家统计的这组数据的目的

b) Perform exploratory data analysis, such as numerical measures and/or the box-and-whisker plot for this data set.
进行探索性数据分析，如数值措施和/或该数据集箱须图。
http://www.ukassignment.org/mgzydx/

c) Construct scatter diagrams for pairs of variables. Describe the relationship that you may see. Do these appear to have some association (linear or non-linear)?
构建对变量的散点图。描述的关系，你可能会看到。做这些出现有一定的关联（线性的或非线性的）

d) Does the linear model appear to hold for some pair? You may want to run some testing to substantiate why or why not.
线性模型是否会举行一些对吗？您可能需要运行一些测试，以证明为什么能或不能。

e) Apply the best-subsets approach to model building to see if there is any variable that shouldn’t be used for this model.
应用最佳子集建立模型的方法，以查看是否有任何不应该被用于此模型的变量。

f) You observe that some universities on the east coast use higher SAT score or TOEFL score for admission. If you introduce one more variable (a dummy variable) by its location, east or west, divided along with Mississippi River to the data set, and use the dummy variable for these qualitative data, will this give you a meaningful (better) output for this model? Or, is there any other new variable that you think can improve your analysis? A new variable can be recreated within the given data or you can add them from external data such as US News & World Report College Ranking.
你观察到东海岸的一些大学使用较高的SAT成绩或托福成绩入学。如果引入一个新的变量（一个虚拟变量），它的位置，向东还是向西，分为沿密西西比河的数据集，这些定性的数据和使用的虚拟变量，这给你一个有意义的（更好的）输出这种模式？或者，是否有任何其他的新的变量，你认为可以提高你的分析？可以重新创建一个新的变量，在给定的数据或从外部数据，如美国新闻与世界报道“大学排行榜，你可以将它们添加。

g) Once you determine which variables are to be used, perform a multiple regression analysis, including collinearity, on this subset of variables.
一旦你确定要使用哪些变量，进行多元回归分析，包括共线性变量的子集。

h) Summarize and comment on your results.
总结和评论你的结果。

Project_B
项目B

Use of the best-subsets approach to model building
通往模型建立最佳子集合的使用

Consider the file advertising.xls showing data for magazine titles, the cost of a full-color page advertisement (page), audience (subscribers), male percentage of subscribers, and household income. The objective of this project is to find out if there is any relationship among variables using regression analysis techniques. You are to write a report about your findings after analyzing the data set. The following is a minimum guideline about what you should analyze. You need to do more in-depth analysis for a better grade than a C. For example, you may have to use such tools as confidence interval estimates, one or two-sample tests on the data to improve the quality of your report.
考虑杂志，成本的全彩页广告（页），观众（用户），男性用户比例，和家庭收入的advertising.xls文件的显示数据。这个项目的目的是要找出是否有任何变量之间的关系，利用回归分析技术。请你写一份报告你的调查结果进行分析后的数据集。以下是什么，你应该分析的最低指引。你需要做更深入的分析，得到比C 更好的成绩。例如，您可能需要使用这些工具的置信区间估计，一个或两个样本检验的数据质量提高你的报告。

a) State your statistical objective for this data set.
国家统计的这组数据的目的。
b) Perform exploratory data analysis, such as numerical measures or the box-and-whisker plot for this data set.
进行探索性数据分析，如数值措施和/或该数据集箱须图。

c) Construct scatter diagrams for pairs of variables. Describe the relationship that you may see. Do these appear to have some association (linear or non-linear)?
构建对变量的散点图。描述的关系，你可能会看到。做这些出现有一定的关联（线性的或非线性的）。

d) Does the linear model appear to hold for any pair of variables? You may want to run some testing to substantiate why or why not.
线性模型是否会举行一些对吗？您可能需要运行一些测试，以证明为什么能或不能。

e) Apply the best-subsets approach to model building to see if there is any variable that shouldn’t be used for this model.
应用最佳子集建立模型的方法，以查看是否有任何不应该被用于此模型的变量。

f) Consider the male percentage of subscribers as categorical data, for example, if it is more than 66%, input as “male magazine,” between 66% and 33% as “gender free,” and less than 33% as “female magazine.” Then introduce dummy variables for these data. Will this give you a meaningful (better) output for this model since some households use male names to subscribe any magazine? Can you introduce any other dummy variables to improve your analysis? A new dummy variable can be created within the data or external data.
考虑分类数据的用户的男性比例，例如，如果是超过66％，输入“男性杂志”，66％和33％作为“性别之间，小于33％为”女杂志上。“然后介绍这些数据的虚拟变量。这会不会给你一个有意义的（更好的）输出这种模式，因为一些家庭使用订阅任何杂志的男性名字？你能否介绍任何其他的虚拟变量来提高你的分析呢？可以创建一个新的虚拟变量内的数据或外部数据。

g) Once you determine which variables are to be used, perform a multiple regression analysis, including collinearity, on this subset of variables.
一旦你确定要使用哪些变量，进行多元回归分析，包括共线性变量的子集。

h) Summarize and comment on your results.
总结和评论你的结果。