How to do it?:

Submission: Submit the link on Github of the assignment to Canvas.


This assignment works with the IMDB Top 1000 data. Find out more information about this data at this link. Import the data and answer the following questions.

  1. List all the names of the columns of the data

  2. Which movies have the highest money earned (Gross)?

  3. What is the lowest rating (IMDB_Rating)? List five movies have this lowest rating.

  4. Which year have the most number of movies released in the list? What is the total of money earned on that year?

  5. What is the total money earned per movies on average?

  6. Calculate the average number of votes by year. Calculate the average number of votes of movies that have IMDB rating greater than 9.

  7. Calculate the average Meta score in 2020 of movies that have number of votes in the third quartile.

  8. (Optional - Challenging). The current Runtime variable is not a numeric. Use the str_remove function to remove min from the variables then use as.numeric to convert the variable to numeric. Calculate the average running time in the 2010s. Calculate the correlation between running time and rating (adding use="complete.obs" in the cor function to ignore the missing values).

  9. We can use select_if to select columns satisfying a condition and use summarise_if to do calculation on columns satisfying a condition. Try the follows to understand these functions.

# Select only character columns
df %>% select_if(is.character)

# Calculate the median of all numeric columns
df %>% summarise_if(is.numeric, mean, na.rm=TRUE)
  1. Implement the follows functions or combos. Drawing a comment or summary from each calculation. The codes in this question should be different from the codes used in other questions.