Data Science: A to Z

5 min readJan 4, 2023

A/B Testing: A statistical way of comparing two versions of a product or feature in order to determine which performs better.
Accuracy: The percentage of correct predictions made by a model.
Artificial Intelligence (AI): The study and design of intelligent agents, which are systems that can reason, learn, and act autonomously.
Association Rule Learning: A method for discovering relationships between variables in large datasets.

Batch Learning: A type of machine learning where the model is trained on the entire dataset at once, rather than incrementally.
Bias: A systematic error or deviation from the true value in a statistical analysis.
Big Data: Very large datasets that may be too large or complex for traditional data processing techniques.
Boosting: A type of ensemble method that combines the predictions of multiple weak models to create a strong model.

Classification: A type of machine learning task that involves predicting a class label for a given input data.
Clustering: A type of unsupervised learning that involves grouping data into clusters based on similarity.
Correlation: A statistical measure of the relationship between two variables.
Cross-Validation: A method for evaluating the performance of a model by dividing the data into training and testing sets and using each in turn as the holdout set.

Decision Tree: A type of machine learning model that uses a tree-like structure to make predictions based on feature values.
Deep Learning: A subfield of machine learning that involves training artificial neural networks on large datasets.
Dimensionality Reduction: The process of reducing the number of features in a dataset by selecting the most important ones or combining correlated features.
DATAcated: Dedicated to data — see Kate Strachnyi :)

Ensemble Method: A machine learning technique that combines the predictions of multiple models to create a more accurate prediction.
Evaluation Metric: A measure used to evaluate the performance of a model.

Feature: A characteristic or attribute of a data point that can be used to predict a target variable.
Feature Engineering: The process of selecting and creating features to be used in a machine learning model.
Feature Selection: The process of selecting a subset of relevant features to use in a model.

Gradient Descent: An optimization algorithm used to find the values of parameters that minimize a loss function.

Hyperparameter: A parameter that is set prior to training a machine learning model and controls the model’s behavior.

Imbalanced Classes: A classification problem where one class significantly outnumbers the other classes.
Imputation: The process of replacing missing or invalid data with a reasonable estimate.

K-Fold Cross-Validation: A type of cross-validation where the data is divided into K folds, and the model is trained and evaluated K times, with a different fold used as the holdout set each time.
K-Means Clustering: A type of clustering algorithm that divides the data into K clusters based on similarity.

Label: The target variable in a supervised learning problem.
Learning Rate: A hyperparameter that controls the step size at which the optimization algorithm makes updates to the model parameters.
Loss Function: A measure of how well a model’s predictions match the ground truth labels.

Machine Learning: A subfield of artificial intelligence that involves training models to make predictions or decisions based on data.
Model: A mathematical representation of a process or system that can be used to make predictions or decisions.

Neural Network: A type of machine learning model inspired by the structure and function of the brain, consisting of layers of interconnected nodes.

Overfitting: A phenomenon that occurs when a model is excessively complex and has learned patterns in the training data that do not generalize to unseen data.

Precision: The percentage of correct positive predictions made by a model.
Predictor Variable: A feature used to predict a target variable in a machine learning model.

Regression: A type of machine learning task that involves predicting a continuous target variable.
Regularization: A technique used to prevent overfitting in machine learning models by adding a penalty term to the loss function.

Sample: A subset of a population used to represent the entire population in statistical analysis.
Supervised Learning: A type of machine learning where the model is trained on labeled data and learns to predict the label for new, unseen data.
Support Vector Machine (SVM): A type of machine learning model that uses a hyperplane to classify data.

Test Set: A subset of a dataset used to evaluate the performance of a machine learning model.
Training Set: A subset of a dataset used to train a machine learning model.

Unsupervised Learning: A type of machine learning where the model is not given any labeled training data and must discover patterns and relationships in the data on its own.

Validation Set: A subset of a dataset used to tune the hyperparameters of a machine learning model.
Variable: A characteristic or attribute that can vary or change within a dataset.

Weight: In the context of machine learning, a weight is a parameter that determines the strength of the influence of an input feature on the output of the model.
Weight Initialization: The process of setting the initial values for the weights of a neural network model.
Weight Regularization: A technique used to prevent overfitting in neural network models by adding a penalty term to the loss function based on the magnitude of the weights.
Weight Sharing: A technique used in neural networks to reduce the number of parameters and improve model performance. It involves sharing the same weights across multiple connections in the network.
Word Embedding: A representation of words as vectors in a high-dimensional space, where semantically similar words are close together in the space.

XGBoost: A gradient boosting library for machine learning that is used to improve model performance.

YOLO (You Only Look Once): A real-time object detection algorithm that uses a convolutional neural network to predict the bounding boxes and class labels for objects in an image.

Zero-Shot Learning: A type of machine learning where the model is able to classify objects it has never seen before, based on its knowledge of other objects and their relationships.

Written by Kate Strachnyi