Kirill's ML from A to Z

Posts

Showing posts from December, 2024

Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers

December 20, 2024

Wow - giving it away for free in the hope that it helps build a better world: https://core.ac.uk/reader/334586725 Machine Learning Machine Learning and Knowledge Discovery Support Vector Machines for Classification Support Vector Regression Hidden Markov Model Bioinspired Computing: Swarm Intelligence Deep Neural Networks Cortical Algorithms Deep Learning Multiobjective Optimization Machine Learning in Action: Examples

Your Handy ML Reference

December 20, 2024

Interactive Blog Table Interactive Blog Post Support Vector Machines for Classification Classifying data points Applications: Image recognition Click on a heading or short description to see more details here. Support Vector Regression Predicting continuous values Applications: Stock price forecasting Hidden Markov Model Modeling sequences Applications: Speech recognition Bioinspired Computing: Swarm Intelligence Distributed optimization Applications: Robotics Deep Neural Networks Multi-layer perceptron Applications: Language translation Cortical Algorithms ...

AttributeError: 'Series' object has no attribute 'reshape' - When You Want to Concatenate Two Columns for Looking at LR Model Performance (Error)

December 19, 2024

y_test is not an ndarray . It's a panda.core.series.Series object. What did that for me? It's coming out of train_test_split ( sklearn.model_selection ) which has been given a y that is a Series object and it politely returned the same. What happened? When you created your y from the input data, did you leave out the .values ? I did - because I left it out intentionally when creating the X - so that I could use a cute snippet to automatically (without visual inspection - you know me, I'm Mr. Automation) find the non-numeric columns to subject to one-hot encoding. And, typing stuff manually (a good reason to use a template and make edits) - I did the same with the y creation. If you have y = dataset[:,-1] .values, you get an ndarray and all is well. Be warned :) Why do we care? Because, to concatenate two vectors as two columns side by side, you need to use reshape: cmp_matrix = np.concatenate( (y_pred.reshape(len(y_pred),1), y_test.reshape((len(y_test),1)) ), axis=1 )

Backwared Elimination in Linear Regression Model Building

December 18, 2024

Backward Elimination Select a significance level to stay in the model (eg. SL = 0.05) Fit the full model with all possible predictors Consider the predictor with the highest P-value. If P > SL, go to STEP 4, otherwise go to FIN Remove the predictor Fit model without this variable Back to (3) FIN - you're done. Congratulations - you've applied LR to build an ML model! If you're using Scikit-Learn, the module automatically selects the statistically significant features, but, if you want to see how BE is done, check out HdP's videos on DropBox

The Dummy Variable Trap

December 18, 2024

Watching this lecture, I felt like I was seeing a case of the right hand not knowing what the left hand was doing since Hadelin de Ponteves has pointed out the case of needing to transform a "name" or "state" feature using one-hot-encoding with ColumnTransformer. KE is doing the same thing and calling it creation of dummy variables. Cool stuff - always drop one of the "dummy variables" you generate using one-hot-encoding. chatG: Tools like pandas.get_dummies and OneHotEncoder (with drop='first' ) in sklearn can automatically exclude one dummy variable:

When Can You Safely Use Linear Regression?

December 18, 2024

According to Kirill , only when you have: Linear relationship Homoscedasticity (equal variance) Multivariate normality (a bimodal distribution would be a disqualifier) Independence - lack of autocorrelation (a stock price depends on its past values) Lack of multicollinearity - independent variables should not influence each other Lack of outliers

What these Courses Don't Teach You

December 18, 2024

Given some data, sure, you can follow what they tell you on cleaning it and using it to predict values based on new inputs. But, how are you supposed to generate the data in the first place? A friend who used to work at Amazon said you need to invest in generating the data. Maybe chatG can suggest ways to generate data based on the problem you're trying to solve. Which course can teach you to do something like what Google Deepmind did - train an AI to play a game by playing against itself (a copy of itself)? Tough?

Reflection - ML Flow Up to Scaling

December 17, 2024

Data Preparation Steps Preparing Data for Machine Learning in Python This guide outlines the critical steps for preparing data in machine learning pipelines. Collect resources Import the necessary Python modules such as pandas , numpy , and sklearn . For readability, it is a good idea to import a module just before it is used. Read in dataset Use pd.read_csv and pay attention to the separator used—if it isn’t commas. Follow convention: use X (capital) and y for feature columns and the output vector. If feature columns are not contiguous, use: X = dataset.iloc[:, [0] + list(range(2, dataset.shape[1]))] . Clean data Use the SimpleImputer class from the sklearn.impute module to substitute the column's mean for missing values in the feature vector: ...

Strike Three - A Titanic Cockup

December 16, 2024

Just too many things broken in this (Coding Exercise 3: Coding Exercise 3: Encoding Categorical Data for Machine Learning ) exercise to NOT frustrate people. Do they have any statistics on how many people got it right first time? Just too much of a leap to make without chatG's help. Identify categorical features. Really? How are you supposed to know to ignore Name, Ticket and Cabin, etc? And why the hell should Pclass be categorical when it's numeric. Can't you do a better job of explaining? Define the problem - what are we setting out to do? If that were clear, we'd know that we have no need for Name, Ticket, etc. And yes, because Pclass *is* numeric but the numbers are meaningless (as you would get if you just assigned natural numbers to names), you have to encode using one-hot. Can't you have a few notes urging people to think along these lines? Crystal, did you actually go through this exercise? During the lesson, we used X as the input to the ColumnTransforme...

Finding Non Numeric Columns

December 16, 2024

Deja vu - had done something like this when Dataquest was open a few days.. chatG: Look for columns containing 'object' or 'category' - you like? Look for columns excluding np.number : dataset.select_dtypes(exclude=[np.number]).columns Collect columns for which pd.api.types.is_numeric_dtype is False : non_numeric_columns = filter ( lambda col: not is_numeric_dtype(df[col]), df.columns)

No, no, n, N, nO --> 0 and Y, y, Yes, yEs, yeS --> 1 Before LabelEncoder

December 16, 2024

You'd think they'd have more user-friendly stuff, but nope Scikit Learn's LabelEncoder can comfortably do Y,N,N,N,Y,Y --> 1,0,0,0,1,1 and same for Yes, No, but will struggle for anything more stressful, which a human would take in her stride. What *can* one do? Transform the input data column, that's what - to make it edible for labelEncoder: If you're still a pd.Series, just do: y = y.str.replace( r'^[nNyY].*', lambda m: m.group(0)[0].upper(), regex=True ) sadf And that's it - you're then cleared to do le = LabelEncoder() y = le.fit_transform(y) You, gentle reader, will easily be able to extend this to the case of Male, Female 😊

All the Things that Can Go Wrong

December 16, 2024

When I import numpy, I get: <frozen importlib._bootstrap>:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject chatG says to upgrade numpy. I ask how I can do that from within a notebook: !pip install --upgrade numpy That gets me stuck in [*] forever. I exit and when I fire up Jupyter again, I find numpy can't be imported. So I decide to install it using Anaconda. But, Anaconda's navigator never gets interactive - I'm stuck in "Loading applications". Worst - there's no way to kill it - Task Mgr doesn't show anything. No way to get rid of the green circular splotch on the screen. Reboots, and tries the Anaconda prompt (after having to reboot again because of the green splotch) conda update numpy gets me that. But, now, when I import pandas, I get an error about np.bool. When I try conda update pandas, "All requested packages already installed" Now, trying ...

Strike Two - No Way to Check Code Output When Coding in the Integrated Code Editor

December 16, 2024

Nice, a video shows you the way and now (good), the exercise has a little extra. But. there's no way to run your code and look at the output. All you get is a pass/fail. Am I supposed to be happy about this? Is this a Udemy limitation?

Feature Scaling and Encoding

December 14, 2024

Why do you need F/S? Simple - you don't want some features dominating your model on account of larger numbers attached to physical units. Large/small don't make sense when there are units attached to them. Therefore, you take the spread in the input feature and map it using normalization and standardization so that it has a range, in units of standard-deviation of about +/- 3 sigma and an average of 0. What about encoding? That's to take care of the cases of non numerical values - names, names of classes, etc. You just assign numerical labels (1,2,3, etc)

Five Ways to Use chatGPT to Up Your Python Game

December 13, 2024

https://www.youtube.com/watch?v=Bw7pAYv6iaM End goal based library suggestions Code debug (bad code and errors - needing workarounds) Code generation Translation: R <--> python (impressive - before your very eyes KE takes a script from U of Cinci's page on MT cars and converts it to py and runs it on colab and gets a plot) Article summarization (you need to keep up with progress in the field and chatG can help)

Strike One - Moving On Without Taking Care of the Late Adopters

December 13, 2024

This is the appetizer logistic regression demo - Part 3 Section 14. What's broken : In your colab, each time, you get ValueError: 'salmon' is not a valid color value. (You'd think, after all these years, and so many questions about this from the early adopters, they would have fixed a "resource" that folks are going to be downloading, but.. they're obviously busy blazing new trails) On the ZIP file you download, the code has just red and green, not salmon and dodgerblue. Then, when you do change from salmon and dodgerblue on Colab, you get "session crashed because of using all available RAM" Come on guys. Give us a break. At least, on an undoctored Jupyter install, the downloaded notebook runs without any issues.. So there is some sort of "currency". And this was referred to me by Crystal Taggart as "hands down, the best ML course."