[All About AI] A Guide to Machine Learning Fundamentals

AI has revolutionized people’s lives. For those who want to gain a deeper understanding of AI and use the technology, the SK hynix Newsroom has created the All About AI series. This second episode covers an overview of one of the most significant branches of AI, machine learning.

What is Machine Learning?

As introduced in the first episode of the All About AI series, machine learning is an algorithm that autonomously learns patterns from data to make predictions. It has become the dominant methodology in AI, especially with the explosive growth of data in recent years. In contrast, traditional AI models required humans to explicitly program rules and logic. While these models worked well for problems with clear, defined rules, such as simple board games, they had limitations when dealing with complex rules and datasets. For example, consider the task of developing an AI image recognition application to identify cats. This would involve steps such as processing countless pixels, interpreting color data, and establishing rules to distinguish a cat’s patterns, highlighting the challenges of such projects.

Machine learning models can distinguish dogs from cats by learning from large datasets of labeled images

Figure 1. Machine learning models can distinguish dogs from cats by learning from large datasets of labeled images

Machine learning involves training a system to discover complex structures and patterns within data, enabling it to learn these patterns independently and make predictions based on new data. As an example, imagine if the aforementioned AI image recognition software to detect cats is created using machine learning. For this process, various photos are collected to train the algorithm, allowing it to learn how to identify cats on its own.

Types and Characteristics of Machine Learning Algorithms

Machine learning algorithms extract data from underlying probability distributions¹ in the real world to train models for tackling specific problems. These algorithms can be broadly categorized into three types, each with distinct characteristics and applications based on the type of problem.

¹Probability distribution: A mathematical model that describes how data is distributed, enabling the identification of patterns and structures within the data. This signals the probability that various outcomes of a given system will occur under different conditions.

1) Supervised learning: This method involves training a model using inputs paired with correct outputs, or labels, and then analyzing these training pairs to predict outcomes for new data. As an example, imagine developing an AI application to predict a person’s gender from a photo. Here, the photo is the input, and the gender is the label. The model learns the characteristics that distinguish gender and applies this knowledge to predict the gender of individuals in new photos. Supervised learning can be further divided into two types depending on the nature of the labels:

- Classification: This algorithm assigns data to specific categories using labels. For example, determining whether a photo contains a dog or distinguishing letters in handwritten text are classification tasks. In these scenarios, the data is assigned to a specific category and labeled accordingly.
- Regression: This technique uses labels for continuous values. For example, predicting house prices based on factors such as size and location, or forecasting temperatures from current weather data, are typical regression problems. In these cases, the goal is to predict numerical values as accurately as possible.

2) Unsupervised learning: As the name suggests, unsupervised learning differs from supervised learning in that it learns from data without explicit “supervision”, meaning there are no labels provided. Instead, unsupervised learning focuses on understanding and learning the characteristics of the data’s probability distribution. Key techniques include:

- - Clustering: This method groups data with similar characteristics to identify hidden patterns in probability distributions. For example, in a real-world semiconductor manufacturing process, applying a clustering algorithm to images of defective wafers² revealed that the defects could be categorized into several types based on their underlying causes.
  - Dimensionality reduction: This technique simplifies high-dimensional data—datasets with numerous features or variables—while retaining the most important information. Thus, it aids in data analysis and visualization. A common example is principal component analysis (PCA), which removes unnecessary information such as noise—irrelevant or random variations.

²Wafer: Generally made from silicon, wafers act as the substrate, or foundation, for semiconductor devices.

Generative AI, which has gained significant traction recently, generally fits into the unsupervised learning category as it learns probability distributions from data to generate new data. For example, ChatGPT learns the probability distribution of natural language to predict the next word in a given text. However, since generative AI often uses supervised learning techniques during training, there is debate about whether it should be considered purely unsupervised learning.

3) Reinforcement learning: This type of learning focuses on training a model to maximize rewards through interaction with its environment. It is particularly effective for tasks requiring sequential decision-making. Thus, it is widely used in robotics to help robots find optimal paths while avoiding obstacles, as well as in autonomous driving and AI gaming. Recently, reinforcement learning from human feedback (RLHF)³ has garnered significant attention for ensuring large language models (LLMs)⁴ such as ChatGPT, enabling them to generate responses which better align with human preferences.

³Reinforcement learning from human feedback (RLHF): A method in which a model learns from “rewards” given based on human feedback. By incorporating human evaluations of its outputs, the model adjusts its behavior to generate responses that better match human preferences.
⁴Large language models (LLM): Advanced AI systems trained on vast amounts of text data to understand and generate human-like text based on the context they are given.

Figure 2. A video clip showing an AI playing a brick-breaking game. In this classic example of reinforcement learning, the AI is given the instruction to break more bricks to improve its score and learns to do so on its own.

Evaluating Machine Learning Performance

The ultimate goal of machine learning is to perform effectively with new, unseen data in real-world situations. In other words, it is important for the model to have generalization capabilities. To this end, it is essential to accurately evaluate and verify the model’s performance, which typically involves the following steps.

1) Choosing Performance Metrics
The choice of performance metrics depends on the type of problem being addressed. In classification problems, common performance metrics include:

- Accuracy: Refers to the proportion of correct predictions. For example, if a medical test correctly diagnoses 95 out of 100 cases, the accuracy rate is 95%. However, a meaningful accuracy assessment requires a balanced dataset. If 95 out of 100 samples are negative and only five are positive, a model predicting all samples as negative would still achieve 95% accuracy. This high level of accuracy can be misleading, as the model would fail to identify any positive samples.
- Precision and recall: Precision measures the proportion of actual positives among predicted positives, while recall measures the proportion of correctly predicted positives among actual positives. These metrics have a trade-off relationship, so it is essential to optimize the model by considering the balance and objectives between precision and recall. For example, in medical testing, increasing recall is considered crucial to detect as many cases of a condition as possible, while precision might be more important for tasks such as spam mail filtering. To address this trade-off problem, the F1 score⁵ is used to evaluate the balance between precision and recall.

For regression problems, performance is typically assessed using metrics such as mean squared error (MSE)⁶ , root mean squared error (RMSE)⁷ , and mean absolute error (MAE)⁸.

⁵F1 score: The harmonic mean of precision and recall, useful for evaluating a model’s classification performance when class imbalance exists. It ranges from 0 to 1, with higher values indicating better performance.
⁶Mean squared error (MSE): The average of the squared differences between predicted and actual values.
⁷Root mean squared error (RMSE): The square root of the MSE, providing error measurement in the same units as the actual values.
⁸Mean absolute error (MAE): The average of the absolute differences between predicted and actual values.

2) Performance Evaluation Methods

A visual representation of the train-test split method and cross-validation techniques used in machine learning

Figure 3. A visual representation of the train-test split method and cross-validation techniques used in machine learning

To evaluate machine learning models, data is usually split into training and testing sets. This helps assess how well the model generalizes new data.

- Train-test split: As one of the simplest methods, it involves dividing data into a training set and a test set. The model is trained on the training set, and then its predictive performance is evaluated on the test set to gauge generalization performance. Typically, 70–80% of the entire data is used for training, with the remainder reserved for testing.
- Cross-validation: This model divides the data into so-called K folds, in which K refers to the number of data subsets, or folds. One fold is used for testing, while the other K-1 folds are used to train the model. This process is repeated K times, and the average performance is calculated. Although cross-validation is common in traditional machine learning, it is time-intensive, making the train-test split the preferred method in deep learning.

3) Understanding Model Performance
The results from these evaluation methods provide critical feedback for improving model performance. However, two common phenomena often arise:

- Underfitting: This occurs when a model is too simple to learn the underlying patterns in the data, resulting in poor performance on both training and testing sets. For example, consider a regression problem where the actual data follows a quadratic function but the predictive model is set as a linear function. The model may in this case be unable to identify patterns, meaning it has low expressivity as it cannot express complex functions, possibly leading to underfitting.
- Overfitting: This occurs when a model is too complex and learns both the data’s patterns and its noise, performing well on training data but poorly on test or new data. To address overfitting and better assess generalization, techniques like cross-validation can be used. Evaluating the model’s performance across various data splits allows for a more accurate assessment of overfitting and helps in selecting the appropriate level of model complexity.

To build a model with strong generalization ability, it is widely recognized that a balance must be reached between underfitting and overfitting through methods such as regularization⁹. Interestingly, recent research in deep learning has revealed a double descent¹⁰ phenomenon, where increasing the model size after initial overfitting does not actually worsen overfitting but can improve generalization performance. This discovery has sparked significant research interest in these areas.

⁹Regularization: A method to prevent overfitting by limiting model complexity or adding penalty terms.
¹⁰Double descent: A phenomenon in which model performance worsens with increasing size up to a point, then improves beyond a certain size. This challenges traditional views on overfitting in deep learning, though it remains theoretically unexplained as it is a newly overserved phenomenon in deep learning.

This concludes the All About AI series, which has provided an overview of the basics of AI and machine learning. Stay tuned to the newsroom for the latest updates on AI memory technologies driving innovation in the industry.

<Other articles from this series>

[All About AI] The Origins, Evolution & Future of AI