Machine Learning

Algorithm

Definition

An algorithm is a step-by-step set of instructions or rules used to solve a problem or perform a task.

Machine Learning

Definition

Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed.

Labeled Data

Definition

Labeled data is data that has been tagged or annotated with the correct output or classification.

Unlabeled Data

Definition

Unlabeled data is data that has not been tagged or annotated with the correct output or classification.

Machine Learning Types

Clustering

Definition

Clustering is the process of grouping similar data points together based on their features. This method is a method of supervised learning.

Classification

Definition

Classification is the process of assigning data points to predefined categories. This method is a method of supervised learning.

Regression

Definition

Regression is the process of predicting continuous values from input data. This method is a method of unsupervised learning.

Reinforcement Learning

Definition

Reinforcement learning is a method of machine learning where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties.

Clustering Algorithms

Regression Algorithms

Classification Algorithms

Simple linear regression

Simple linear regression is a statistical method that models the relationship between a dependent variable and a single independent variable by fitting a linear equation to observed data. The equation of simple linear regression is typically represented as:

$$ y = mx + b $$

Multiple linear regression

Multiple linear regression is a statistical technique that models the relationship between a dependent variable and two or more independent variables by fitting a linear equation to observed data. The equation of multiple linear regression is typically represented as:

$$ y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n $$

There also exists a matrix representation of the multiple linear regression equation:

$$ \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & a_{1,1} & \cdots & a_{1,p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & a_{n,1} & \cdots & a_{n,p} \end{bmatrix} + \begin{bmatrix} \theta_0 \\ \vdots \\ \theta_p \end{bmatrix} $$

Logistic regression

Definition

Logistic regression is a statistical method used for binary classification problems. It models the probability of a binary outcome based on one or more independent variables.

Sigmoid Function

The logistic regression model uses the sigmoid function to map any real value into a probability between 0 and 1:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Correlation matrix

Definition

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables.

Evaluation metrics of Machine Learning

Clustering

Definition

Clustering is the task of grouping a set of objects in such a way that objects in the same group, called a cluster, are more similar to each other than to those in other groups.

Dimensionality reduction

Definition

Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables.

Dimensionality Reduction Algorithms

K-means clustering

k-means-clustering.

Definition

K-means clustering is an unsupervised machine learning algorithm that partitions a dataset into k distinct, non-overlapping clusters based on the similarity of data points. It uses unlabeled data to identify patterns.

K-Means Algorithm

  1. Select the number of clusters $k$ and randomly initialize $k$ centroids.

  2. Assign each data point to the nearest centroid, forming $k$ clusters.

  3. Update the centroids by calculating the mean of all data points in each cluster.

  4. Repeat steps 2 and 3 until convergence.

Remark

K-Means uses Euclidean distance $d$ to measure the similarity between data points and centroids.

$$ d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} $$

K-Means Optimisation with elbow method

The elbow method is a heuristic used to determine the optimal number of clusters $k$ in K-Means clustering. It involves plotting the sum of squared errors or SSE against the number of clusters and identifying the elbow point where the rate of decrease sharply changes.

$$ \text{WCCS } = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2 $$

K-nearest neighbors, KNN

Definition

K-nearest neighbors is a non-parametric method used for classification and regression. In both cases, the input consists of the $k$ closest training examples in the feature space. It uses labeled data to identify patterns.

K Nearest Neighbors Application

To classify a new data point into one of the two existing categories, we use the KNN algorithm. Based on the spatial proximity of points, KNN assigns to the new point the category that contain the most of nearest neighbours .

K Nearest Neighbors Algorithm

  1. Choose the number of neighbors $k$.

  2. Calculate the distance between the new data point and all training data points.

  3. Identify the $k$ nearest neighbors based on the calculated distances.

  4. For classification, assign the most common class among the $k$ neighbors to the new data point. For regression, calculate the mean value of the $k$ neighbors and assign it to the new data point.

Remark

The most commonly used algorithms for KNN use the following distance metrics:

Elements of a Confusion Matrix

Confusion Matrix

Definition

A confusion matrix is a table used to evaluate the performance of a classification model. It shows the counts of correct and incorrect predictions for each class.

confusion-matrix

Classification metrics

Regression metrics