Python Fundamentals | Research Design | Statistics | Supervised Learning |
---|---|---|---|
Float
"What will be the output of the following command?
print(type(1/3)) |
1) Calling people on landlines
Give an example of a survey technique that is vulnerable to selection bias.
|
1) Number of children people have
2) Number of cars people own
What is an example of a distribution that is commonly skewed left? How is this distribution represented in a histogram?
|
Machine Learning is the training of a model from data that generalizes a decision against a performance measure.
What is machine learning?
|
Blue
colors = [‘red’, ‘white’, ‘blue’, ‘green’]
What will be the value of colors[2]? |
In regression, you are measuring the distance between the predicted and actual values. For classification, you can't measure the distance because the outcomes are categorical. Instead, you are measuring the percentage of values that were predicted correctly.
What is the difference between evaluating a regression and classification model? Why is there a difference?
|
Mean > median
If a distribution is skewed right, which of the following describes the relationship between the distribution's mean and median and why? 1) Mean < median, 2) Mean > median, 3) Mean = median
|
When the x-variable is perfectly X times y, and X does not exist without y. E.g., y = 2x, when x equals number of sandwiches and y equals number of slices of bread.
Under what circumstances would the intercept in a linear regression equal zero and contain a non-zero coefficient? What's an example?
|
df[(df["first_name"].notnull()) & (df["nationality"] == "UK") & (df["age") < 30)]
Select all rows in df where nationality is UK, age is less than 30, and name is NOT NULL.
name nationality age Jason USA 42 Molly USA 52 NaN France 36 NaN UK 24 NaN UK 70 |
If you "train" the model on the first 2/3 of your dataset vs. the second 2/3, your scores will be different because you are using different data to test and train. K-folds ensures that we use ALL of the data in both a testing and training capacity.
What is the "problem" with splitting data into training and testing sets that k-folds cross validation helps alleviate?
|
Absolutely nothing!
If 10 is added to every value of a data set, what happens to the standard deviation?
|
Normal distribution - histogram. No collinearity - correlation matrix. Residuals have equal variance - residual plot. Observations are independent - rigorous data collection.
What are the assumptions required for a linear regression model? How do we check for them?
|
18, 18
x = 12
y = 18 x = y y = x print(x,y) What will be the output? |
If you have very large errors, they will be exacerbated by RMSE or mean squared error because you are squaring the values
Under what circumstances would the root mean squared error (RMSE) or the mean squared error be higher than the mean absolute error?
|
Data is likely non-linear and we should not be using OLS
What conclusions can we draw from this residual plot?
https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/05model_check/residual_fitted_alpha_plot.gif |
R2 will INCREASE.
What is the effect on R2 in an OLS model if outliers are removed?
|
for element in x:
i = 0 w = element + i print (w)
Write a function that loops through the following list and prints the cumulative sum for each element.
x = [10, 25, 70, 100, 500] |
Thus, for a male, the odds of being admitted are 5.44 times larger than the odds for a female being admitted.
How would we interpret the following logarithmic regression equation?
https://stats.idre.ucla.edu/stata/faq/how-do-i-interpret-odds-ratios-in-logistic-regression/ |
The price is predicted to increase 1767.292 when the foreign variable goes up by one, decrease by 294.1955 when mpg goes up by one, and is predicted to be 11905.42 when both mpg and foreign are zero.
How would we interpret the coefficients of the following table?
http://dss.princeton.edu/online_help/analysis/interpreting_regression.htm |
100%. Every point would be the "nearest neighbor" of itself.
If we used our training data as testing data in KNN, what would be our accuracy rate and why?
|