## Shiny practice items 'Data Science camp', module 1

This app allows you to practice various aspects covered in this module.

Please make sure that you have fully completed the following tasks before continuing with this app:

1. Read the module's learning plan

2. Watch the topic videos and review the lecture notes

3. Work through the SPSS How to guide

The module has the following objectives:

1. Describe the basic idea of data driven decision making

2. Organize data and classify variables into two types

3. Set up and manage a dataset in SPSS

4. Perform descriptive statistical analyses using a dataset

Please proceed with the practice items by clicking on the links in the top bar of this app. It is not necessary to complete these items in order. We encourage you to work together with a classmate!

There are three types of practice items: theory ('TH'), SPSS ('SP') and multiple choice ('MC').

Thus, you should complete these practice items before joining the lab meeting corresponding to this module. MBA program
Peter Ebbes

#### Variable 1

The number of MBA students in a given year at HEC Paris

Quantitative variable; numbers of this scale have natural meaning; multiplications in the context of this construct (number of students) make sense

#### Variable 2

The age of an MBA participant, measured as 1=younger than 25yrs, 2=in between 25-30 yrs, 3=in between 31 and 35 yrs, 4 = in between 36-40 yrs, 5 = over 40 yrs

Ordinal scale (categorical variable); the numbers 1,2,3,4,5 reflect order with respect to the underlying construct (age)

#### Variable 3

Whether an MBA student lives on campus or off campus

Nominal scale (categorical variable); the numbers (e.g.) 1=off campus; 2=on campus are just labels

#### Variable 4

The overall satisfaction of the students for mathcamp measured on a five point satisfaction scale (1=very unsatisfied, 2=unsatisfied, 3=neither satisfied, nor unsatisfied, 4=satisfied, 5=very satisfied)

Quantitative variable; this can be debated. This is an attitude rating scale. It seems reasonable to compute an average satisfaction, hence, some arithmetic for this variable makes sense

#### Variable 5

How many times a student went to the gym in the past week, measured as the exact number (0,1,2,3,. etc. times)

Quantitative variable; same comment as variable 1

#### Variable 6

How many times a student went to the gym in the past week, measured as 'not at all', '1-2' times, '3-4' times, 'more than 4 times'

Ordinal scale (categorical variable); same comment as variable 2

#### Variable 7

The GMAT score of the student applicant

Quantitative variable. It seems reasonable to compute an average GMAT score, hence, some arithmetic for this variable makes sense

#### Variable 1

The age of an MBA participant, measured as 1=younger than 25yrs, 2=in between 25-30 yrs, 3=in between 31 and 35 yrs, 4 = in between 36-40 yrs, 5 = over 40 yrs

Because this is a categorical variable, you should run a frequency table which gives you the counts and proportions for each age category. To visualize the data on this variable, you could create a bar chart (a pie chart would be ok too but it is not recommended). Note that you should not recommend computing the average age here. Your data is stored as 1,2,3,4,5. An average has no to little meaning within the context of this age measurement scale (e.g. how would you interpret an average of 2.78?). You should also not recommend a histogram to visualize the data. Histograms are only suitable for quantitative variables.

#### Variable 2

Whether an MBA student lives on campus or off campus

This is also a categorical variable, so you should recommend the same techniques as for variable 1. Here perhaps a pie chart could work because there are only two categories.

#### Variable 3

The overall satisfaction of the students for mathcamp measured on a five point satisfaction scale (1=very unsatisfied, 2=unsatisfied, 3=neither satisfied, nor unsatisfied, 4=satisfied, 5=very satisfied)

Treating this variable as quantitative variable will make your live easier. So, you could recommend descriptive statistics for quantitative variables, for instance the mean and standard deviation to capture central tendency and dispersion. Here, a histogram would visualize the responses. You could also propose a boxplot but probably that graph will not come out nicely, as this variable only has 5 distinct values.

#### Variable 4

How many times a student went to the gym in the past week, measured as the exact number (0,1,2,3,. etc. times)

This variable is a quantitative variable, so you should recommend descriptive statistics for a quantitative variable. As for the previous question, the mean and standard deviation would be useful. Here, I would also investigate the minimum and maximum. It is not unthinkable that at least one respondent puts down a very high number (out, liar!)! Maybe check the median whether the data is long tailed (to the right). It could be that most students do not go to the gym a lot but a few students go very often (of course, we all like to think we go a lot!). A histogram would also be very important to construct here! The boxplot may or may not come out nicely depending on how many unique values you get here and the lumpiness of the data.

#### Variable 5

The GMAT score of the student when (s)he applied for the program

This is a quantitative variable, so you should summarize this variable using central tendency measures (e.g. mean and/or median) and dispersion measures (e.g. standard deviation). You could also think of the 5 number summary (which 5 numbers would you need to include?). Both histogram and boxplot could be useful to visualize the distribution of gmat.

#### Purpose

##### Practice interpreting descriptive statistics for a quantitative variable. Let's do it!

An important task of applied statistics and data science is to visualize the data. For a quantitative variable, we could create a boxplot. A professor of a large undergrad class (1000 students!) created a boxplot of the final exam scores that (s)he graded on a 0-100 scale. Carefully examine the boxplot and then answer the questions below. #### Question 1

Just by eye-balling this graph, what would be the 5 number summary of the test scores for the final exam?

#### Question 2

Another way to visualize the values of a quantitative variable is to make a histogram. Grab a piece of paper, and sketch out the histogram. How do you think the histrogram of these test scores would look like?

#### Question 3

Looking at the test scores, would you say that the mean test score is higher or lower than the median test score? Why?

#### Question 4

We found that the 75th percentile is (rounded) 94. What is the interpretation of this number?

#### Question 5

Arguably, one could say that the final exam in this example was fairly easy [[ you may think: are you kidding me? How could there be such as thing as an easy exam?! ]]. The exam scores are a bit squeezed towards the upper range with an average test score that is quite generous. Ideally, exam scores follow a normal distribution, like IQ for example. Grab a piece of paper and sketch out how the histogram and boxplot would look like if the test scores follow a normal distribution with an average of 75.

#### Question 1

In this module's topic/video we analyzed the 'date of primary card issued' for American Express users. We saw that many users joined before 2001. They have been a customer at American Express already for a long time. There are (relatively) fewer custeromers with a more recent tenure. You can find this frequency table in the lecture slides.

Suppose you are a manager at Discover. Construct the same table as we did for American Express users. Compare the two tables. What do you conclude?

#### Question 2

Your manager asks you whether there are differences between Men and Women and what card they use as primary card. Do Men and Women have the same card brand preference? Use the case data and SPSS to investigate this question. What did you find?

#### Question 3

Similarly, your manager also wants to know if there is a difference between the tenure of Men and Women. Or put it differently, whether Men And Women have had their primary card equally long or not at the time of data collection. Use the case data and SPSS to investigate this question. What did you find?

#### Question 1

In this module's topic we learned about descriptive statistics for quantitative variables within the context of the American Express mini case. We summarized the data on the amount spent on groceries and retail across all card holders. You are now being asked to summarize the spending on groceries and retail for the year 2007.

What are the central tendency summaries for grocery and retail spending in 2007?

#### Question 2

Following up on the previous question, compute the measures for dispersion. What did you find?

#### Question 3

Use data visualization to describe the pattern of the monthly number of grocery items that the panel members purchase across 2007-2008. What would you conclude?

#### Purpose

##### Test your knowledge about the subjects of this module. Let's do it!

1. You collect daily data for exactly one working week (5 days) for 50 faculty members of HEC Paris on how much they spend every day in the HEC campus restaurant. Each faculty member agreed to participate and you identify him/her with an random identifier. You also collect their gender. Which of the following options best describes how to organize the data in a data table?

2. Consider the following cross tab. What is the correct interpretation of the number 48.1 in the cell [Visa, 2001-2005]? 3. What is the scale level of the the variable GMAT score of a student applicant?

4. For which of the following variables can we compute a standard deviation to describe its dispersion?

5. For which of the following variables would you recommend constructing a bar chart?