Strike Three - A Titanic Cockup
Just too many things broken in this (Coding Exercise 3: Coding Exercise 3: Encoding Categorical Data for Machine Learning) exercise to NOT frustrate people. Do they have any statistics on how many people got it right first time? Just too much of a leap to make without chatG's help.
Identify categorical features.
Really? How are you supposed to know to ignore Name, Ticket and Cabin, etc? And why the hell should Pclass be categorical when it's numeric. Can't you do a better job of explaining?
Define the problem - what are we setting out to do? If that were clear, we'd know that we have no need for Name, Ticket, etc. And yes, because Pclass *is* numeric but the numbers are meaningless (as you would get if you just assigned natural numbers to names), you have to encode using one-hot. Can't you have a few notes urging people to think along these lines? Crystal, did you actually go through this exercise?
During the lesson, we used X as the input to the ColumnTransformer since we had just one column to hit. Now, we want to call out columns by name, so we can't do that, but have to use the dataset. Why not spend a few words on this during the lesson?
And, having talked about real features, where the h are you picking the useful stuff from the dataset for X? In the solution, all you do is assign the output of the transformer. But, think now about what columns are actually useful! ID? Fare? Ticket?
And, biggest of all, y is "Survived" - which is already 0,1 - why the h do you need to do LabelEncoding on this?
Extraordinarily frustrating. Well, you get what you pay for. If it's free through Gale Presents Udemy, what do you expect?
Whether I do
# Print the updated matrix of features and the dependent variable vector
print("Updated matrix of features:\n", enc_X)
print("UPdated dependent variable vector:\n", y)
Or
print(enc_X)
print(y)
at the end, I get the same:
Arrays are not equal
(shapes (183, 11), (183, 17) mismatch)
x: array([[2, 1, 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', ...,
71.2833, 'C85', 'C'],
[4, 1, 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', ..., 53.1,...
y: array([[1.0, 0.0, 1.0, ..., 'PC 17599', 71.2833, 'C85'],
[1.0, 0.0, 0.0, ..., '113803', 53.1, 'C123'],
[0.0, 1.0, 0.0, ..., '17463', 51.8625, 'E46'],...
Why should y be something weird like that? y is only 0's and 1's from the survived column using dataset.iloc[:,1].values
Like I said, strike 3.
Comments
Post a Comment