Shiny practice items 'Data Science camp', module 1

This app allows you to practice various aspects covered in this module.

Please make sure that you have fully completed the following tasks before continuing with this app:

1. Read the module's learning plan

2. Watch the topic videos and review the lecture notes

3. Work through the SPSS How to guide



The module has the following objectives:

1. Describe the basic idea of data driven decision making

2. Organize data and classify variables into two types

3. Set up and manage a dataset in SPSS

4. Perform descriptive statistical analyses using a dataset



Please proceed with the practice items by clicking on the links in the top bar of this app. It is not necessary to complete these items in order. We encourage you to work together with a classmate!

There are three types of practice items: theory ('TH'), SPSS ('SP') and multiple choice ('MC').

These practice items will help you come prepared to the lab and help you perform better on the quizzes.

Thus, you should complete these practice items before joining the lab meeting corresponding to this module.

MBA program
Peter Ebbes


Topic

Measurement scales: categorical or quantitative variables


Lecture

topic 2, Organizing data and variable types


Purpose

Practice identifying the correct measurement scale for a variable. Let's do it!


Question

What is the scale level / measurement level of the following variables?


Variable 1

The number of MBA students in a given year at HEC Paris


Quantitative variable; numbers of this scale have natural meaning; multiplications in the context of this construct (number of students) make sense


Variable 2

The age of an MBA participant, measured as 1=younger than 25yrs, 2=in between 25-30 yrs, 3=in between 31 and 35 yrs, 4 = in between 36-40 yrs, 5 = over 40 yrs


Ordinal scale (categorical variable); the numbers 1,2,3,4,5 reflect order with respect to the underlying construct (age)


Variable 3

Whether an MBA student lives on campus or off campus


Nominal scale (categorical variable); the numbers (e.g.) 1=off campus; 2=on campus are just labels


Variable 4

The overall satisfaction of the students for mathcamp measured on a five point satisfaction scale (1=very unsatisfied, 2=unsatisfied, 3=neither satisfied, nor unsatisfied, 4=satisfied, 5=very satisfied)


Quantitative variable; this can be debated. This is an attitude rating scale. It seems reasonable to compute an average satisfaction, hence, some arithmetic for this variable makes sense


Variable 5

How many times a student went to the gym in the past week, measured as the exact number (0,1,2,3,. etc. times)


Quantitative variable; same comment as variable 1


Variable 6

How many times a student went to the gym in the past week, measured as 'not at all', '1-2' times, '3-4' times, 'more than 4 times'


Ordinal scale (categorical variable); same comment as variable 2


Variable 7

The GMAT score of the student applicant


Quantitative variable. It seems reasonable to compute an average GMAT score, hence, some arithmetic for this variable makes sense


Topic

Descriptive statistics for categorical and quantitative variables


Lecture

topic 5, Descriptive statistics: categorical variable

topic 7, Descriptive statistics: quantitative variable


Purpose

Practice identifying the correct descriptive statistical technique for variables in a dataset. Let's do it!


Question

Suppose you collected data on MBA students at HEC Paris. You organized the data in a data table. Now you need to summarize the data using descriptive statistical techniques. Which techniques would you recommend for each one of the following variables in your data table? Recommend one numerical and one graphical technique.


Variable 1

The age of an MBA participant, measured as 1=younger than 25yrs, 2=in between 25-30 yrs, 3=in between 31 and 35 yrs, 4 = in between 36-40 yrs, 5 = over 40 yrs


Because this is a categorical variable, you should run a frequency table which gives you the counts and proportions for each age category. To visualize the data on this variable, you could create a bar chart (a pie chart would be ok too but it is not recommended). Note that you should not recommend computing the average age here. Your data is stored as 1,2,3,4,5. An average has no to little meaning within the context of this age measurement scale (e.g. how would you interpret an average of 2.78?). You should also not recommend a histogram to visualize the data. Histograms are only suitable for quantitative variables.


Variable 2

Whether an MBA student lives on campus or off campus


This is also a categorical variable, so you should recommend the same techniques as for variable 1. Here perhaps a pie chart could work because there are only two categories.


Variable 3

The overall satisfaction of the students for mathcamp measured on a five point satisfaction scale (1=very unsatisfied, 2=unsatisfied, 3=neither satisfied, nor unsatisfied, 4=satisfied, 5=very satisfied)


Treating this variable as quantitative variable will make your live easier. So, you could recommend descriptive statistics for quantitative variables, for instance the mean and standard deviation to capture central tendency and dispersion. Here, a histogram would visualize the responses. You could also propose a boxplot but probably that graph will not come out nicely, as this variable only has 5 distinct values.


Variable 4

How many times a student went to the gym in the past week, measured as the exact number (0,1,2,3,. etc. times)


This variable is a quantitative variable, so you should recommend descriptive statistics for a quantitative variable. As for the previous question, the mean and standard deviation would be useful. Here, I would also investigate the minimum and maximum. It is not unthinkable that at least one respondent puts down a very high number (out, liar!)! Maybe check the median whether the data is long tailed (to the right). It could be that most students do not go to the gym a lot but a few students go very often (of course, we all like to think we go a lot!). A histogram would also be very important to construct here! The boxplot may or may not come out nicely depending on how many unique values you get here and the lumpiness of the data.


Variable 5

The GMAT score of the student when (s)he applied for the program


This is a quantitative variable, so you should summarize this variable using central tendency measures (e.g. mean and/or median) and dispersion measures (e.g. standard deviation). You could also think of the 5 number summary (which 5 numbers would you need to include?). Both histogram and boxplot could be useful to visualize the distribution of gmat.


Topic

Practice descriptive statistics for a quantitative variable


Lecture

topic 7, Descriptive statistics: quantitative variable.


Purpose

Practice interpreting descriptive statistics for a quantitative variable. Let's do it!


An important task of applied statistics and data science is to visualize the data. For a quantitative variable, we could create a boxplot. A professor of a large undergrad class (1000 students!) created a boxplot of the final exam scores that (s)he graded on a 0-100 scale. Carefully examine the boxplot and then answer the questions below.


Question 1

Just by eye-balling this graph, what would be the 5 number summary of the test scores for the final exam?



Question 2

Another way to visualize the values of a quantitative variable is to make a histogram. Grab a piece of paper, and sketch out the histogram. How do you think the histrogram of these test scores would look like?



Question 3

Looking at the test scores, would you say that the mean test score is higher or lower than the median test score? Why?



Question 4

We found that the 75th percentile is (rounded) 94. What is the interpretation of this number?



Question 5

Arguably, one could say that the final exam in this example was fairly easy [[ you may think: are you kidding me? How could there be such a thing as an easy exam?! ]]. The exam scores are a bit squeezed towards the upper range with an average test score that is quite generous. Ideally, exam scores follow a normal distribution, like IQ for example. Grab a piece of paper and sketch out how the histogram and boxplot would look like if the test scores follow a normal distribution with an average of 75.



Topic

Practice working with SPSS


Lecture

topic 5, Descriptive statistics: categorical variable.

topic 6, Descriptive statistics: two categorical variables.

This activity is only useful if you have reviewed the corresponding SPSS How to guide.


Purpose

Follow the managerial/research questions to further practice your SPSS skills. Let's do it!


Question 1

In this module's topic/video we analyzed the 'date of primary card issued' for American Express users. We saw that many users joined before 2001. They have been a customer at American Express already for a long time. There are (relatively) fewer custeromers with a more recent tenure. You can find this frequency table in the lecture slides.

Suppose you are a manager at Discover. Construct the same table as we did for American Express users. Compare the two tables. What do you conclude?



Question 2

Your manager asks you whether there are differences between Men and Women and what card they use as primary card. Do Men and Women have the same card brand preference? Use the case data and SPSS to investigate this question. What did you find?



Question 3

Similarly, your manager also wants to know if there is a difference between the tenure of Men and Women. Or put it differently, whether Men And Women have had their primary card equally long or not at the time of data collection. Use the case data and SPSS to investigate this question. What did you find?



Topic

Practice working with SPSS


Lecture

topic 7, Descriptive statistics: quantitative variable.

This activity is only useful if you have reviewed the corresponding SPSS How to guide.


Purpose

Follow the managerial/research questions to further practice your SPSS skills. Let's do it!


Question 1

In this module's topic we learned about descriptive statistics for quantitative variables within the context of the American Express mini case. We summarized the data on the amount spent on groceries and retail across all card holders. You are now being asked to summarize the spending on groceries and retail for the year 2007.

What are the central tendency summaries for grocery and retail spending in 2007?



Question 2

Following up on the previous question, compute the measures for dispersion. What did you find?



Question 3

Use data visualization to describe the distribution of the monthly number of grocery items that the panel members purchase across 2007-2008. What would you conclude?



Topic

Five multiple choice practice questions


Lecture

Module 1, all topics


Purpose

Test your knowledge about the subjects of this module. Let's do it!


1. You collect daily data for exactly one working week (5 days) for 50 faculty members of HEC Paris on how much they spend every day in the HEC campus restaurant. Each faculty member agreed to participate and you identify him/her with an random identifier. You also collect their gender. Which of the following options best describes how to organize the data in a data table?


2. Consider the following cross tab. What is the correct interpretation of the number 48.1 in the cell [Visa, 2001-2005]?


3. What is the scale level of the the variable GMAT score of a student applicant?


4. For which of the following variables can we compute a standard deviation to describe its dispersion?


5. For which of the following variables would you recommend constructing a bar chart?