Katies Super Awesome Data Science Jeopardy Jeopardy Template

Python Fundamentals	Research Design	Statistics	Supervised Learning
100 Float "What will be the output of the following command? print(type(1/3))	100 1) Calling people on landlines Give an example of a survey technique that is vulnerable to selection bias.	100 1) Number of children people have 2) Number of cars people own What is an example of a distribution that is commonly skewed left? How is this distribution represented in a histogram?	100 Machine Learning is the training of a model from data that generalizes a decision against a performance measure. What is machine learning?
200 Blue colors = [‘red’, ‘white’, ‘blue’, ‘green’] What will be the value of colors[2]?	200 In regression, you are measuring the distance between the predicted and actual values. For classification, you can't measure the distance because the outcomes are categorical. Instead, you are measuring the percentage of values that were predicted correctly. What is the difference between evaluating a regression and classification model? Why is there a difference?	200 Mean > median If a distribution is skewed right, which of the following describes the relationship between the distribution's mean and median and why? 1) Mean < median, 2) Mean > median, 3) Mean = median	200 When the x-variable is perfectly X times y, and X does not exist without y. E.g., y = 2x, when x equals number of sandwiches and y equals number of slices of bread. Under what circumstances would the intercept in a linear regression equal zero and contain a non-zero coefficient? What's an example?
300 df[(df["first_name"].notnull()) & (df["nationality"] == "UK") & (df["age") < 30)] Select all rows in df where nationality is UK, age is less than 30, and name is NOT NULL. name nationality age Jason USA 42 Molly USA 52 NaN France 36 NaN UK 24 NaN UK 70	300 If you "train" the model on the first 2/3 of your dataset vs. the second 2/3, your scores will be different because you are using different data to test and train. K-folds ensures that we use ALL of the data in both a testing and training capacity. What is the "problem" with splitting data into training and testing sets that k-folds cross validation helps alleviate?	300 Absolutely nothing! If 10 is added to every value of a data set, what happens to the standard deviation?	300 Normal distribution - histogram. No collinearity - correlation matrix. Residuals have equal variance - residual plot. Observations are independent - rigorous data collection. What are the assumptions required for a linear regression model? How do we check for them?
400 18, 18 x = 12 y = 18 x = y y = x print(x,y) What will be the output?	400 If you have very large errors, they will be exacerbated by RMSE or mean squared error because you are squaring the values Under what circumstances would the root mean squared error (RMSE) or the mean squared error be higher than the mean absolute error?	400 Data is likely non-linear and we should not be using OLS What conclusions can we draw from this residual plot? https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/05model_check/residual_fitted_alpha_plot.gif	400 R2 will INCREASE. What is the effect on R2 in an OLS model if outliers are removed?
500 for element in x: i = 0 w = element + i print (w) Write a function that loops through the following list and prints the cumulative sum for each element. x = [10, 25, 70, 100, 500]	500 Thus, for a male, the odds of being admitted are 5.44 times larger than the odds for a female being admitted. How would we interpret the following logarithmic regression equation? https://stats.idre.ucla.edu/stata/faq/how-do-i-interpret-odds-ratios-in-logistic-regression/	500 The price is predicted to increase 1767.292 when the foreign variable goes up by one, decrease by 294.1955 when mpg goes up by one, and is predicted to be 11905.42 when both mpg and foreign are zero. How would we interpret the coefficients of the following table? http://dss.princeton.edu/online_help/analysis/interpreting_regression.htm	500 100%. Every point would be the "nearest neighbor" of itself. If we used our training data as testing data in KNN, what would be our accuracy rate and why?

Katie's Super Awesome Data Science Jeopardy

Press F11 for full screen mode