Data Science: A to Z
5 min readJan 4, 2023
A
- A/B Testing: A statistical way of comparing two versions of a product or feature in order to determine which performs better.
- Accuracy: The percentage of correct predictions made by a model.
- Artificial Intelligence (AI): The study and design of intelligent agents, which are systems that can reason, learn, and act autonomously.
- Association Rule Learning: A method for discovering relationships between variables in large datasets.
B
- Batch Learning: A type of machine learning where the model is trained on the entire dataset at once, rather than incrementally.
- Bias: A systematic error or deviation from the true value in a statistical analysis.
- Big Data: Very large datasets that may be too large or complex for traditional data processing techniques.
- Boosting: A type of ensemble method that combines the predictions of multiple weak models to create a strong model.
C
- Classification: A type of machine learning task that involves predicting a class label for a given input data.
- Clustering: A type of unsupervised learning that involves grouping data into clusters based on similarity.
- Correlation: A statistical measure of the relationship between two variables.
- Cross-Validation: A method for evaluating the performance of a model by dividing the data into training and testing sets and using each in turn as the holdout set.
D
- Decision Tree: A type of machine learning model that uses a tree-like structure to make predictions based on feature values.
- Deep Learning: A subfield of machine learning that involves training artificial neural networks on large datasets.
- Dimensionality Reduction: The process of reducing the number of features in a dataset by selecting the most important ones or combining correlated features.
- DATAcated: Dedicated to data — see Kate Strachnyi :)
E
- Ensemble Method: A machine learning technique that combines the predictions of multiple models to create a more accurate prediction.
- Evaluation Metric: A measure used to evaluate the performance of a model.
F
- Feature: A characteristic or attribute of a data point that can be used to predict a target variable.
- Feature Engineering: The process of selecting and creating features to be used in a machine learning model.
- Feature Selection: The process of selecting a subset of relevant features to use in a model.
G
- Gradient Descent: An optimization algorithm used to find the values of parameters that minimize a loss function.
H
- Hyperparameter: A parameter that is set prior to training a machine learning model and controls the model’s behavior.
I
- Imbalanced Classes: A classification problem where one class significantly outnumbers the other classes.
- Imputation: The process of replacing missing or invalid data with a reasonable estimate.
K
- K-Fold Cross-Validation: A type of cross-validation where the data is divided into K folds, and the model is trained and evaluated K times, with a different fold used as the holdout set each time.
- K-Means Clustering: A type of clustering algorithm that divides the data into K clusters based on similarity.
L
- Label: The target variable in a supervised learning problem.
- Learning Rate: A hyperparameter that controls the step size at which the optimization algorithm makes updates to the model parameters.
- Loss Function: A measure of how well a model’s predictions match the ground truth labels.
M
- Machine Learning: A subfield of artificial intelligence that involves training models to make predictions or decisions based on data.
- Model: A mathematical representation of a process or system that can be used to make predictions or decisions.
N
- Neural Network: A type of machine learning model inspired by the structure and function of the brain, consisting of layers of interconnected nodes.
O
- Overfitting: A phenomenon that occurs when a model is excessively complex and has learned patterns in the training data that do not generalize to unseen data.
P
- Precision: The percentage of correct positive predictions made by a model.
- Predictor Variable: A feature used to predict a target variable in a machine learning model.
R
- Regression: A type of machine learning task that involves predicting a continuous target variable.
- Regularization: A technique used to prevent overfitting in machine learning models by adding a penalty term to the loss function.
S
- Sample: A subset of a population used to represent the entire population in statistical analysis.
- Supervised Learning: A type of machine learning where the model is trained on labeled data and learns to predict the label for new, unseen data.
- Support Vector Machine (SVM): A type of machine learning model that uses a hyperplane to classify data.
T
- Test Set: A subset of a dataset used to evaluate the performance of a machine learning model.
- Training Set: A subset of a dataset used to train a machine learning model.
U
- Unsupervised Learning: A type of machine learning where the model is not given any labeled training data and must discover patterns and relationships in the data on its own.
V
- Validation Set: A subset of a dataset used to tune the hyperparameters of a machine learning model.
- Variable: A characteristic or attribute that can vary or change within a dataset.
W
- Weight: In the context of machine learning, a weight is a parameter that determines the strength of the influence of an input feature on the output of the model.
- Weight Initialization: The process of setting the initial values for the weights of a neural network model.
- Weight Regularization: A technique used to prevent overfitting in neural network models by adding a penalty term to the loss function based on the magnitude of the weights.
- Weight Sharing: A technique used in neural networks to reduce the number of parameters and improve model performance. It involves sharing the same weights across multiple connections in the network.
- Word Embedding: A representation of words as vectors in a high-dimensional space, where semantically similar words are close together in the space.
X:
- XGBoost: A gradient boosting library for machine learning that is used to improve model performance.
Y
- YOLO (You Only Look Once): A real-time object detection algorithm that uses a convolutional neural network to predict the bounding boxes and class labels for objects in an image.
Z:
- Zero-Shot Learning: A type of machine learning where the model is able to classify objects it has never seen before, based on its knowledge of other objects and their relationships.