| Title: | Leveraging Experiment Lines to Data Analytics |
|---|---|
| Description: | The natural increase in the complexity of current research experiments and data demands better tools to enhance productivity in Data Analytics. The package is a framework designed to address the modern challenges in data analytics workflows. The package is inspired by Experiment Line concepts. It aims to provide seamless support for users in developing their data mining workflows by offering a uniform data model and method API. It enables the integration of various data mining activities, including data preprocessing, classification, regression, clustering, and time series prediction. It also offers options for hyper-parameter tuning and supports integration with existing libraries and languages. Overall, the package provides researchers with a comprehensive set of functionalities for data science, promoting ease of use, extensibility, and integration with various tools and libraries. Information on Experiment Line is based on Ogasawara et al. (2009) <doi:10.1007/978-3-642-02279-1_20>. |
| Authors: | Eduardo Ogasawara [aut, ths, cre] (ORCID: <https://orcid.org/0000-0002-0466-0626>), Ana Carolina Sá [aut], Antonio Castro [aut], Caio Santos [aut], Diego Carvalho [ctb], Diego Salles [aut], Eduardo Bezerra [ctb], Esther Pacitti [ctb], Fabio Porto [ctb], Janio Lima [aut], Lucas Tavares [aut], Rafaelli Coutinho [ctb], Rebecca Salles [aut], Vinicius Saidy [aut], CEFET/RJ [cph] |
| Maintainer: | Eduardo Ogasawara <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.3.747 |
| Built: | 2026-05-20 03:02:36 UTC |
| Source: | https://github.com/cefet-rj-dal/daltoolbox |
Generic to apply the object to data (e.g., predict, transform).
action(obj, ...)action(obj, ...)
obj |
object: a dal_base object to apply the transformation on the input dataset. |
... |
optional arguments. |
returns the result of an action of the model applied in provided data
data(iris) # an example is minmax normalization trans <- minmax() trans <- fit(trans, iris) tiris <- action(trans, iris)data(iris) # an example is minmax normalization trans <- minmax() trans <- fit(trans, iris) tiris <- action(trans, iris)
Default action() implementation that proxies to transform() for transforms.
## S3 method for class 'dal_transform' action(obj, ...)## S3 method for class 'dal_transform' action(obj, ...)
obj |
object |
... |
optional arguments |
returns a transformed data
#See ?minmax for an example of transformation#See ?minmax for an example of transformation
One‑hot encode a factor vector into a matrix of indicator columns.
adjust_class_label(x, valTrue = 1, valFalse = 0)adjust_class_label(x, valTrue = 1, valFalse = 0)
x |
vector to be categorized |
valTrue |
value to represent true |
valFalse |
value to represent false |
Values are mapped to valTrue/valFalse (default 1/0). The resulting matrix has column names equal to levels(x).
returns an adjusted categorical mapping
Coerce an object to data.frame if needed (useful for S3 methods in this package).
adjust_data.frame(data)adjust_data.frame(data)
data |
dataset |
returns a data.frame
data(iris) df <- adjust_data.frame(iris)data(iris) df <- adjust_data.frame(iris)
Convert a vector to a factor with specified internal levels (ilevels) and labels (slevels).
adjust_factor(value, ilevels, slevels)adjust_factor(value, ilevels, slevels)
value |
vector to be converted into factor |
ilevels |
order for categorical values |
slevels |
labels for categorical values |
Numeric vectors are first converted to factors with ilevels as the level order, then relabeled to slevels.
returns an adjusted factor
Coerce an object to matrix if needed (useful before algorithms that expect matrices).
adjust_matrix(data)adjust_matrix(data)
data |
dataset |
returns an adjusted matrix
data(iris) mat <- adjust_matrix(iris)data(iris) mat <- adjust_matrix(iris)
Aggregate data by a grouping attribute using named expressions.
aggregation(group, ...)aggregation(group, ...)
group |
grouping column name (string) |
... |
named expressions evaluated per group |
returns an object of class aggregation
data(iris) agg <- aggregation( "Species", mean_sepal = mean(Sepal.Length), sd_sepal = sd(Sepal.Length), n = n() ) iris_agg <- transform(agg, iris) iris_aggdata(iris) agg <- aggregation( "Species", mean_sepal = mean(Sepal.Length), sd_sepal = sd(Sepal.Length), n = n() ) iris_agg <- transform(agg, iris) iris_agg
Base class for encoder‑only autoencoders. Intended to be subclassed by concrete implementations that learn a lower‑dimensional latent representation.
autoenc_base_e(input_size, encoding_size)autoenc_base_e(input_size, encoding_size)
input_size |
dimensionality of the input vector |
encoding_size |
dimensionality of the latent (encoded) vector |
This base does not train or transform by itself (identity). Implementations should
override fit() to learn parameters and transform() to output the encoded representation.
returns an autoenc_base_e object
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science.
# See an end‑to‑end example at: # https://github.com/cefet-rj-dal/daltoolbox/blob/main/autoencoder/autoenc_base_e.md# See an end‑to‑end example at: # https://github.com/cefet-rj-dal/daltoolbox/blob/main/autoencoder/autoenc_base_e.md
Base class for autoencoders that both encode and decode. Intended to be subclassed by concrete implementations that learn to compress and reconstruct inputs.
autoenc_base_ed(input_size, encoding_size)autoenc_base_ed(input_size, encoding_size)
input_size |
dimensionality of the input vector |
encoding_size |
dimensionality of the latent (encoded) vector |
This base does not train or transform by itself (identity). Implementations should
override fit() to learn parameters and transform() to perform encode+decode.
returns an autoenc_base_ed object
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science.
# See an end‑to‑end example at: # https://github.com/cefet-rj-dal/daltoolbox/blob/main/autoencoder/autoenc_base_ed.md# See an end‑to‑end example at: # https://github.com/cefet-rj-dal/daltoolbox/blob/main/autoencoder/autoenc_base_ed.md
Balance class distributions by randomly replicating minority examples or by generating synthetic samples with a local SMOTE implementation.
bal_oversampling(attribute, method = c("smote", "random"), k = 5)bal_oversampling(attribute, method = c("smote", "random"), k = 5)
attribute |
target class attribute name |
method |
oversampling strategy: |
k |
number of nearest neighbors used by the SMOTE strategy |
returns an object of class bal_oversampling
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique.
data(iris) iris_imb <- iris[c(1:50, 51:71, 101:111), ] bal <- bal_oversampling("Species", method = "smote") iris_bal <- transform(bal, iris_imb) table(iris_bal$Species)data(iris) iris_imb <- iris[c(1:50, 51:71, 101:111), ] bal <- bal_oversampling("Species", method = "smote") iris_bal <- transform(bal, iris_imb) table(iris_bal$Species)
Balance class distributions by randomly reducing all classes to the minority count.
bal_subsampling(attribute)bal_subsampling(attribute)
attribute |
target class attribute name |
returns an object of class bal_subsampling
data(iris) iris_imb <- iris[c(1:50, 51:71, 101:111), ] bal <- bal_subsampling("Species") iris_bal <- transform(bal, iris_imb) table(iris_bal$Species)data(iris) iris_imb <- iris[c(1:50, 51:71, 101:111), ] bal <- bal_subsampling("Species") iris_bal <- transform(bal, iris_imb) table(iris_bal$Species)
housing values in suburbs of Boston.
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds)
nox: nitric oxides concentration (parts per 10 million)
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted distances to five Boston employment centres
rad: index of accessibility to radial highways
tax: full-value property-tax rate per $10,000
ptratio: pupil-teacher ratio by town
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat: percentage of lower status of the population
medv: Median value of owner-occupied homes in $1000's
data(Boston)data(Boston)
Regression Dataset.
This dataset was obtained from the MASS library.
Creator: Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, J. Environ. Economics & Management, vol.5, 81-102, 1978.
data(Boston) head(Boston)data(Boston) head(Boston)
Convert a factor column into dummy variables (one‑hot encoding) using model.matrix without intercept.
Each level becomes a separate binary column.
categ_mapping(attribute)categ_mapping(attribute)
attribute |
attribute to be categorized. |
This is a light wrapper around stats::model.matrix(~ attr - 1, data) that drops the original column
and returns only the dummy variables.
returns a data frame with binary attributes, one for each possible category.
cm <- categ_mapping("Species") iris_cm <- transform(cm, iris) # can be made in a single column species <- iris[,"Species", drop=FALSE] iris_cm <- transform(cm, species)cm <- categ_mapping("Species") iris_cm <- transform(cm, iris) # can be made in a single column species <- iris[,"Species", drop=FALSE] iris_cm <- transform(cm, species)
Bagging classifier using ipred::bagging.
cla_bagging(attribute, nbagg = 25)cla_bagging(attribute, nbagg = 25)
attribute |
target attribute name |
nbagg |
number of bootstrap aggregations |
returns a cla_bagging object
if (requireNamespace("ipred", quietly = TRUE)) { data(iris) model <- cla_bagging("Species", nbagg = 25) model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics }if (requireNamespace("ipred", quietly = TRUE)) { data(iris) model <- cla_bagging("Species", nbagg = 25) model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics }
Boosting classifier using adabag::boosting.
cla_boosting(attribute, mfinal = 50)cla_boosting(attribute, mfinal = 50)
attribute |
target attribute name |
mfinal |
number of boosting iterations |
returns a cla_boosting object
if (requireNamespace("adabag", quietly = TRUE)) { data(iris) model <- cla_boosting("Species", mfinal = 10) model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics }if (requireNamespace("adabag", quietly = TRUE)) { data(iris) model <- cla_boosting("Species", mfinal = 10) model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics }
Univariate decision tree for classification using recursive partitioning.
This wrapper uses the tree package.
cla_dtree(attribute, slevels)cla_dtree(attribute, slevels)
attribute |
attribute target to model building |
slevels |
the possible values for the target classification |
Decision trees split the feature space by maximizing node purity (e.g., Gini/entropy), yielding a human‑readable set of rules. They are fast and interpretable, and often used as base learners in ensembles.
returns a classification object
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.
data(iris) slevels <- levels(iris$Species) model <- cla_dtree("Species", slevels) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metricsdata(iris) slevels <- levels(iris$Species) model <- cla_dtree("Species", slevels) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metrics
Logistic regression classifier using stats::glm with binomial family.
cla_glm(attribute, positive, features = NULL, threshold = 0.5)cla_glm(attribute, positive, features = NULL, threshold = 0.5)
attribute |
target attribute name |
positive |
positive class label |
features |
optional vector of feature names (default: all except attribute) |
threshold |
probability threshold for positive class |
returns a cla_glm object
data(iris) iris_bin <- iris iris_bin$IsVersicolor <- factor(ifelse( iris_bin$Species == "versicolor", "versicolor", "not_versicolor" )) model <- cla_glm("IsVersicolor", positive = "versicolor") model <- suppressWarnings(fit(model, iris_bin)) pred <- predict(model, iris_bin) eval <- evaluate(model, adjust_class_label(iris_bin$IsVersicolor), pred) eval$metricsdata(iris) iris_bin <- iris iris_bin$IsVersicolor <- factor(ifelse( iris_bin$Species == "versicolor", "versicolor", "not_versicolor" )) model <- cla_glm("IsVersicolor", positive = "versicolor") model <- suppressWarnings(fit(model, iris_bin)) pred <- predict(model, iris_bin) eval <- evaluate(model, adjust_class_label(iris_bin$IsVersicolor), pred) eval$metrics
Logistic regression with L1 penalty using glmnet::cv.glmnet.
cla_glmnet(attribute, lambda = c("lambda.min", "lambda.1se"))cla_glmnet(attribute, lambda = c("lambda.min", "lambda.1se"))
attribute |
target attribute name (binary) |
lambda |
which lambda to use ("lambda.min" or "lambda.1se") |
returns a cla_glmnet object
if (requireNamespace("glmnet", quietly = TRUE)) { data(iris) iris_bin <- iris iris_bin$IsVersicolor <- factor(ifelse( iris_bin$Species == "versicolor", "versicolor", "not_versicolor" )) model <- cla_glmnet("IsVersicolor") model <- fit(model, iris_bin) pred <- predict(model, iris_bin) eval <- evaluate(model, adjust_class_label(iris_bin$IsVersicolor), pred) eval$metrics }if (requireNamespace("glmnet", quietly = TRUE)) { data(iris) iris_bin <- iris iris_bin$IsVersicolor <- factor(ifelse( iris_bin$Species == "versicolor", "versicolor", "not_versicolor" )) model <- cla_glmnet("IsVersicolor") model <- fit(model, iris_bin) pred <- predict(model, iris_bin) eval <- evaluate(model, adjust_class_label(iris_bin$IsVersicolor), pred) eval$metrics }
Classification by majority vote among the k nearest neighbors. Uses class::knn.
cla_knn(attribute, slevels, k = 1)cla_knn(attribute, slevels, k = 1)
attribute |
attribute target to model building. |
slevels |
possible values for the target classification. |
k |
a vector of integers indicating the number of neighbors to be considered. |
KNN is a simple, non‑parametric method. Choice of k trades bias/variance; distance metric is Euclidean by default.
returns a knn object.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Info. Theory.
data(iris) slevels <- levels(iris$Species) model <- cla_knn("Species", slevels, k=3) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metricsdata(iris) slevels <- levels(iris$Species) model <- cla_knn("Species", slevels, k=3) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metrics
Trivial classifier that always predicts the most frequent class observed in the training data. Useful as a baseline.
cla_majority(attribute, slevels)cla_majority(attribute, slevels)
attribute |
attribute target to model building. |
slevels |
possible values for the target classification. |
returns a classification object.
data(iris) slevels <- levels(iris$Species) model <- cla_majority("Species", slevels) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metricsdata(iris) slevels <- levels(iris$Species) model <- cla_majority("Species", slevels) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metrics
Multi-Layer Perceptron classifier using nnet::nnet (single hidden layer).
cla_mlp(attribute, slevels, size = NULL, decay = 0.1, maxit = 1000)cla_mlp(attribute, slevels, size = NULL, decay = 0.1, maxit = 1000)
attribute |
attribute target to model building |
slevels |
possible values for the target classification |
size |
number of nodes that will be used in the hidden layer |
decay |
how quickly it decreases in gradient descent |
maxit |
maximum iterations |
Uses softmax output with one‑hot targets from adjust_class_label. size controls hidden units and
decay applies L2 regularization. Features should be scaled.
returns a classification object
Rumelhart, D., Hinton, G., Williams, R. (1986). Learning representations by back‑propagating errors. Bishop, C. M. (1995). Neural Networks for Pattern Recognition.
data(iris) slevels <- levels(iris$Species) model <- cla_mlp("Species", slevels, size=3, decay=0.03) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metricsdata(iris) slevels <- levels(iris$Species) model <- cla_mlp("Species", slevels, size=3, decay=0.03) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metrics
Multiclass classification using nnet::multinom.
cla_multinom(attribute, features = NULL)cla_multinom(attribute, features = NULL)
attribute |
target attribute name |
features |
optional vector of feature names (default: all except attribute) |
returns a cla_multinom object
data(iris) model <- cla_multinom("Species") model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metricsdata(iris) model <- cla_multinom("Species") model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics
Naive Bayes classification using e1071::naiveBayes.
cla_nb(attribute, slevels)cla_nb(attribute, slevels)
attribute |
attribute target to model building. |
slevels |
possible values for the target classification. |
Assumes conditional independence of features given the class label, enabling fast probabilistic classification.
returns a classification object.
Mitchell, T. (1997). Machine Learning. McGraw‑Hill. (Naive Bayes)
data(iris) slevels <- levels(iris$Species) model <- cla_nb("Species", slevels) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metricsdata(iris) slevels <- levels(iris$Species) model <- cla_nb("Species", slevels) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metrics
Ensemble classifier of decision trees using randomForest::randomForest.
cla_rf(attribute, slevels, nodesize = 5, ntree = 10, mtry = NULL)cla_rf(attribute, slevels, nodesize = 5, ntree = 10, mtry = NULL)
attribute |
attribute target to model building |
slevels |
possible values for the target classification |
nodesize |
node size |
ntree |
number of trees |
mtry |
number of attributes to build tree |
Combines many decorrelated trees to reduce variance. Key hyperparameters: ntree, mtry, nodesize.
returns a classification object
Breiman, L. (2001). Random Forests. Machine Learning 45(1):5–32. Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News.
data(iris) slevels <- levels(iris$Species) model <- cla_rf("Species", slevels, ntree=5) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metricsdata(iris) slevels <- levels(iris$Species) model <- cla_rf("Species", slevels, ntree=5) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metrics
Classification tree using rpart::rpart.
cla_rpart(attribute)cla_rpart(attribute)
attribute |
target attribute name |
returns a cla_rpart object
if (requireNamespace("rpart", quietly = TRUE)) { data(iris) model <- cla_rpart("Species") model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics }if (requireNamespace("rpart", quietly = TRUE)) { data(iris) model <- cla_rpart("Species") model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics }
Support Vector Machines (SVM) for classification using e1071::svm.
cla_svm( attribute, slevels, epsilon = 0.1, cost = 10, kernel = c("radial", "linear", "polynomial", "sigmoid") )cla_svm( attribute, slevels, epsilon = 0.1, cost = 10, kernel = c("radial", "linear", "polynomial", "sigmoid") )
attribute |
attribute target to model building |
slevels |
possible values for the target classification |
epsilon |
parameter that controls the width of the margin around the separating hyperplane |
cost |
parameter that controls the trade-off between having a wide margin and correctly classifying training data points |
kernel |
the type of kernel function to be used in the SVM algorithm (linear, radial, polynomial, sigmoid) |
SVMs find a maximum‑margin hyperplane in a transformed feature space defined
by a kernel (linear, radial, polynomial, sigmoid). The cost controls the trade‑off
between margin width and training error; epsilon affects stopping; kernel sets the feature map.
returns a SVM classification object
Cortes, C. and Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3):273–297. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines.
data(iris) slevels <- levels(iris$Species) model <- cla_svm("Species", slevels, epsilon=0.0,cost=20.000) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metricsdata(iris) slevels <- levels(iris$Species) model <- cla_svm("Species", slevels, epsilon=0.0,cost=20.000) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test model <- fit(model, train) prediction <- predict(model, test) predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metrics
Tune hyperparameters of a base classifier via k‑fold cross‑validation using a chosen metric.
cla_tune(base_model, folds = 10, ranges = NULL, metric = "accuracy")cla_tune(base_model, folds = 10, ranges = NULL, metric = "accuracy")
base_model |
base model for tuning |
folds |
number of folds for cross-validation |
ranges |
a list of hyperparameter ranges to explore |
metric |
metric used to optimize |
returns a cla_tune object
Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.
# preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test # hyper parameter setup tune <- cla_tune(cla_mlp("Species", levels(iris$Species)), ranges=list(size=c(3:5), decay=c(0.1))) # hyper parameter optimization model <- fit(tune, train) # testing optimization test_prediction <- predict(model, test) test_predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics# preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, iris) train <- sr$train test <- sr$test # hyper parameter setup tune <- cla_tune(cla_mlp("Species", levels(iris$Species)), ranges=list(size=c(3:5), decay=c(0.1))) # hyper parameter optimization model <- fit(tune, train) # testing optimization test_prediction <- predict(model, test) test_predictand <- adjust_class_label(test[,"Species"]) test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics
Gradient boosting classifier using xgboost.
cla_xgboost(attribute, params = list(), nrounds = 20)cla_xgboost(attribute, params = list(), nrounds = 20)
attribute |
target attribute name |
params |
list of xgboost parameters |
nrounds |
number of boosting rounds |
returns a cla_xgboost object
if (requireNamespace("xgboost", quietly = TRUE)) { data(iris) # This setup keeps the example fast for checks and documentation builds. # A more typical starting point is: # model <- cla_xgboost("Species") model <- cla_xgboost( "Species", params = list(max_depth = 1, nthread = 1), nrounds = 1 ) model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics }if (requireNamespace("xgboost", quietly = TRUE)) { data(iris) # This setup keeps the example fast for checks and documentation builds. # A more typical starting point is: # model <- cla_xgboost("Species") model <- cla_xgboost( "Species", params = list(max_depth = 1, nthread = 1), nrounds = 1 ) model <- fit(model, iris) pred <- predict(model, iris) eval <- evaluate(model, adjust_class_label(iris$Species), pred) eval$metrics }
Ancestor class for classification models providing common fields (target attribute and levels) and evaluation helpers.
classification(attribute, slevels = NULL)classification(attribute, slevels = NULL)
attribute |
attribute target to model building |
slevels |
possible values for the target classification |
returns a classification object
#See ?cla_dtree for a classification example using a decision tree#See ?cla_dtree for a classification example using a decision tree
Tune clustering hyperparameters by evaluating an intrinsic metric over a parameter grid and selecting the elbow (max curvature).
clu_tune(base_model, folds = 10, ranges = NULL)clu_tune(base_model, folds = 10, ranges = NULL)
base_model |
base model for tuning |
folds |
number of folds for cross-validation |
ranges |
a list of hyperparameter ranges to explore |
returns a clu_tune object.
Satopaa, V. et al. (2011). Finding a “Kneedle” in a Haystack.
data(iris) # fit model model <- clu_tune(cluster_kmeans(k = 2), ranges = list(k = 2:10)) model <- fit(model, iris[,1:4]) model$kdata(iris) # fit model model <- clu_tune(cluster_kmeans(k = 2), ranges = list(k = 2:10)) model <- fit(model, iris[,1:4]) model$k
Generic for clustering methods
cluster(obj, ...)cluster(obj, ...)
obj |
a |
... |
optional arguments |
clustered data
#See ?cluster_kmeans for an example of transformation#See ?cluster_kmeans for an example of transformation
Fuzzy c-means clustering using e1071::cmeans.
cluster_cmeans(centers = 2, m = 2, iter = 100, dist = "euclidean")cluster_cmeans(centers = 2, m = 2, iter = 100, dist = "euclidean")
centers |
number of clusters |
m |
fuzziness parameter (m > 1) |
iter |
maximum number of iterations |
dist |
distance method passed to |
Produces soft membership for each cluster. The hard assignment is returned by cluster().
Membership degrees are returned in the membership attribute.
returns a fuzzy clustering object.
Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms.
data(iris) model <- cluster_cmeans(centers = 3, m = 2) model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu)data(iris) model <- cluster_cmeans(centers = 3, m = 2) model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu)
Density-Based Spatial Clustering of Applications with Noise using dbscan::dbscan.
cluster_dbscan(minPts = 3, eps = NULL)cluster_dbscan(minPts = 3, eps = NULL)
minPts |
minimum number of points |
eps |
distance value |
Discovers clusters as dense regions separated by sparse areas. Hyperparameters are eps (neighborhood radius)
and minPts (minimum points). If eps is missing, it is estimated from the kNN distance curve elbow.
returns a dbscan object
Ester, M., Kriegel, H.-P., Sander, J., Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
# setup clustering model <- cluster_dbscan(minPts = 3) #load dataset data(iris) # build model model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu) # evaluate model using external metric eval <- evaluate(model, clu, iris$Species) eval# setup clustering model <- cluster_dbscan(minPts = 3) #load dataset data(iris) # build model model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu) # evaluate model using external metric eval <- evaluate(model, clu, iris$Species) eval
Model-based clustering using mclust::Mclust.
cluster_gmm(G = NULL, modelNames = NULL)cluster_gmm(G = NULL, modelNames = NULL)
G |
number of mixture components (clusters). If NULL, |
modelNames |
optional character vector of model names passed to |
Fits a Gaussian mixture model and returns the MAP classification.
The fitted model is stored in obj$model. Requires the mclust package.
returns a GMM clustering object.
Fraley, C., & Raftery, A. E. (2002). Model-based clustering. JASA.
if (requireNamespace("mclust", quietly = TRUE)) { data(iris) model <- cluster_gmm(G = 3) model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu) }if (requireNamespace("mclust", quietly = TRUE)) { data(iris) model <- cluster_gmm(G = 3) model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu) }
Agglomerative hierarchical clustering using stats::hclust.
cluster_hclust( k = 2, h = NULL, method = c("ward.D2", "ward.D", "single", "complete", "average", "mcquitty", "median", "centroid"), dist = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), scale = TRUE )cluster_hclust( k = 2, h = NULL, method = c("ward.D2", "ward.D", "single", "complete", "average", "mcquitty", "median", "centroid"), dist = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), scale = TRUE )
k |
number of clusters to cut the tree (default 2) |
h |
height to cut the tree (optional; if provided, overrides |
method |
linkage method passed to |
dist |
distance method passed to |
scale |
logical; whether to scale data before distance (default TRUE) |
Computes a distance matrix (optionally after scaling) and builds a dendrogram. Clusters are
obtained by cutting the tree with k (number of clusters) or h (height).
returns a hierarchical clustering object.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika.
data(iris) model <- cluster_hclust(k = 3) model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu)data(iris) model <- cluster_hclust(k = 3) model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu)
k-means clustering using stats::kmeans.
cluster_kmeans(k = 1)cluster_kmeans(k = 1)
k |
the number of clusters to form. |
Partitions data into k clusters minimizing within‑cluster sum of squares. The intrinsic quality metric returned is the total within‑cluster SSE (lower is better).
returns a k-means object.
MacQueen, J. (1967). Some Methods for classification and Analysis of Multivariate Observations. Lloyd, S. (1982). Least squares quantization in PCM.
# setup clustering model <- cluster_kmeans(k=3) #load dataset data(iris) # build model model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu) # evaluate model using external metric eval <- evaluate(model, clu, iris$Species) eval# setup clustering model <- cluster_kmeans(k=3) #load dataset data(iris) # build model model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu) # evaluate model using external metric eval <- evaluate(model, clu, iris$Species) eval
Graph community detection using igraph::cluster_louvain.
cluster_louvain_graph(weights = NULL)cluster_louvain_graph(weights = NULL)
weights |
optional edge weights to pass to |
Accepts an igraph object and returns community memberships.
Requires the igraph package.
returns a Louvain clustering object.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. J. Statistical Mechanics.
if (requireNamespace("igraph", quietly = TRUE)) { g <- igraph::sample_gnp(n = 20, p = 0.15) model <- cluster_louvain_graph() model <- fit(model, g) clu <- cluster(model, g) table(clu) }if (requireNamespace("igraph", quietly = TRUE)) { g <- igraph::sample_gnp(n = 20, p = 0.15) model <- cluster_louvain_graph() model <- fit(model, g) clu <- cluster(model, g) table(clu) }
Clustering around representative data points (medoids) using cluster::pam.
cluster_pam(k = 1)cluster_pam(k = 1)
k |
the number of clusters to generate. |
More robust to outliers than k‑means. The intrinsic metric reported is the within‑cluster SSE to medoids.
returns PAM object.
Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis.
# setup clustering model <- cluster_pam(k = 3) #load dataset data(iris) # build model model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu) # evaluate model using external metric eval <- evaluate(model, clu, iris$Species) eval# setup clustering model <- cluster_pam(k = 3) #load dataset data(iris) # build model model <- fit(model, iris[,1:4]) clu <- cluster(model, iris[,1:4]) table(clu) # evaluate model using external metric eval <- evaluate(model, clu, iris$Species) eval
Base class for clustering algorithms and related evaluation utilities.
clusterer()clusterer()
The object stores shared state and defaults used by clustering methods. Current algorithms may still differ in how much they use this state, but the goal is to standardize future implementations around:
fit() learning and storing model state
cluster() producing labels from a fitted model
configurable internal/external metrics and selection helpers via cluutils()
returns a clusterer object
#See ?cluster_kmeans for an example of transformation#See ?cluster_kmeans for an example of transformation
Utility object that groups clustering metrics and model-selection helpers.
cluutils()cluutils()
The object organizes helpers into two semantic groups:
Metrics
metric_wcss() computes the total within-cluster sum of squares.
metric_silhouette() computes the mean silhouette score from pairwise distances.
metric_entropy() computes external clustering entropy against a reference label.
metric_purity() computes cluster purity against a reference label.
metric_davies_bouldin() computes the Davies-Bouldin index.
metric_calinski_harabasz() computes the Calinski-Harabasz score.
metric_adjusted_rand_index() computes the adjusted Rand index.
metric_noise_points() summarizes the number of noise points in density-based clustering.
metric_loglik() and metric_modularity() expose model-specific quality summaries.
Selectors
selector_best() selects the best hyperparameter value by direct optimization.
selector_elbow() selects the elbow of a metric curve via maximum curvature.
Metric helpers return a standardized list with fields metric, value, goal,
and type. This keeps the contract uniform even when the metrics themselves differ.
returns a cluutils object exposing metric and selector helpers.
utils <- cluutils() data(iris) x <- iris[, 1:4] clu <- stats::kmeans(x, centers = 3)$cluster utils$metric_wcss(x, clu) utils$metric_silhouette(x, clu) utils$metric_entropy(clu, iris$Species) utils$selector_best(c(0.31, 0.42, 0.39), goal = "maximize")utils <- cluutils() data(iris) x <- iris[, 1:4] clu <- stats::kmeans(x, centers = 3)$cluster utils$metric_wcss(x, clu) utils$metric_silhouette(x, clu) utils$metric_entropy(clu, iris$Species) utils$selector_best(c(0.31, 0.42, 0.39), goal = "maximize")
Minimal abstract base class for all DAL objects. Defines the common generics fit() and action()
used by transforms and learners.
dal_base()dal_base()
returns a dal_base object
trans <- dal_base()trans <- dal_base()
A collection of small plotting helpers built on ggplot2 used across the package
to quickly visualize vectors, grouped summaries and time series. All functions return a
ggplot2::ggplot object so you can further customize the theme, scales, and annotations.
Conventions adopted:
Input data generally follows the pattern: first column is an index or category (x), remaining columns
are numeric series; in some functions a long format is expected with columns named x, value, variable.
The colors parameter accepts either a single color or a vector mapped to groups/variables.
Transparency is controlled by alpha where provided.
All helpers set a light theme_bw() baseline and place legends at the bottom by default.
ggplot2
Base ancestor for learning tasks (classification, regression, clustering, time series).
Provides common behavior such as proxying action() to the model‑specific operation
(e.g., predict() for predictors, cluster() for clusterers) and an evaluate() generic.
An example of a learner is a decision tree (see cla_dtree).
dal_learner()dal_learner()
returns a learner object
#See ?cla_dtree for a classification example using a decision tree#See ?cla_dtree for a classification example using a decision tree
Base class for data transformations with optional fit()/inverse_transform() support.
dal_transform()dal_transform()
The default transform() calls the underlying action.default(); subclasses should implement
transform.className and optionally inverse_transform.className.
returns a dal_transform object
# See ?minmax or ?zscore for examples# See ?minmax or ?zscore for examples
Base class for hyperparameter optimization that stores a base model, a fold count, and a parameter grid. Specializations (classification/regression/clustering) implement the evaluation logic.
dal_tune(base_model, folds = 10, ranges)dal_tune(base_model, folds = 10, ranges)
base_model |
base model for tuning |
folds |
number of folds for cross-validation |
ranges |
a list of hyperparameter ranges to explore |
Ranges are expanded via expand.grid, and selection is delegated to select_hyper() which can be
overridden by subclasses to implement custom criteria.
returns a dal_tune object
#See ?cla_tune for classification tuning #See ?reg_tune for regression tuning #See ?ts_tune for time series tuning#See ?cla_tune for classification tuning #See ?reg_tune for regression tuning #See ?ts_tune for time series tuning
Base class for sampling strategies that provide train/test splitting and k‑fold partitioning.
Two standard implementations are sample_random() and sample_stratified().
data_sample()data_sample()
returns an object of class data_sample
#using random sampling sample <- sample_random() tt <- train_test(sample, iris) # distribution of train table(tt$train$Species) # preparing dataset into four folds folds <- k_fold(sample, iris, 4) # distribution of folds tbl <- NULL for (f in folds) { tbl <- rbind(tbl, table(f$Species)) } head(tbl)#using random sampling sample <- sample_random() tt <- train_test(sample, iris) # distribution of train table(tt$train$Species) # preparing dataset into four folds folds <- k_fold(sample, iris, 4) # distribution of folds tbl <- NULL for (f in folds) { tbl <- rbind(tbl, table(f$Species)) } head(tbl)
Generic for pattern discovery.
discover(obj, ...)discover(obj, ...)
obj |
a |
... |
optional arguments |
discovered patterns
Principal Component Analysis (PCA) for unsupervised dimensionality reduction. Transforms correlated variables into orthogonal principal components ordered by explained variance.
dt_pca(attribute = NULL, components = NULL)dt_pca(attribute = NULL, components = NULL)
attribute |
target attribute to model building |
components |
number of components for PCA |
Fits PCA on (optionally) the numeric predictors only (excluding attribute when provided),
removes constant columns, and selects the number of components by an elbow rule (minimum curvature)
unless components is set explicitly. New data are projected with the same centering
and scaling learned during fit().
returns an object of class dt_pca
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.
mypca <- dt_pca("Species") # Automatically fitting number of components mypca <- fit(mypca, iris) iris.pca <- transform(mypca, iris) head(iris.pca) head(mypca$pca.transf) # Manual establishment of number of components mypca <- dt_pca("Species", 3) mypca <- fit(mypca, datasets::iris) iris.pca <- transform(mypca, iris) head(iris.pca) head(mypca$pca.transf)mypca <- dt_pca("Species") # Automatically fitting number of components mypca <- fit(mypca, iris) iris.pca <- transform(mypca, iris) head(iris.pca) head(mypca$pca.transf) # Manual establishment of number of components mypca <- dt_pca("Species", 3) mypca <- fit(mypca, datasets::iris) iris.pca <- transform(mypca, iris) head(iris.pca) head(mypca$pca.transf)
Evaluate learner performance. The actual evaluate varies according to the type of learner (clustering, classification, regression, time series regression)
evaluate(obj, ...)evaluate(obj, ...)
obj |
object |
... |
optional arguments |
returns the evaluation
data(iris) slevels <- levels(iris$Species) model <- cla_dtree("Species", slevels) model <- fit(model, iris) prediction <- predict(model, iris) predictand <- adjust_class_label(iris[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metricsdata(iris) slevels <- levels(iris$Species) model <- cla_dtree("Species", slevels) model <- fit(model, iris) prediction <- predict(model, iris) predictand <- adjust_class_label(iris[,"Species"]) test_eval <- evaluate(model, predictand, prediction) test_eval$metrics
Create new features from existing columns using named expressions.
feature_generation(...)feature_generation(...)
... |
named expressions that compute new features |
returns an object of class feature_generation
data(iris) gen <- feature_generation( Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width, Sepal.Ratio = Sepal.Length / Sepal.Width ) iris_feat <- transform(gen, iris) head(iris_feat)data(iris) gen <- feature_generation( Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width, Sepal.Ratio = Sepal.Length / Sepal.Width ) iris_feat <- transform(gen, iris) head(iris_feat)
Remove highly correlated numeric features based on a correlation cutoff.
feature_selection_corr(cutoff = 0.9, features = NULL, keep = NULL)feature_selection_corr(cutoff = 0.9, features = NULL, keep = NULL)
cutoff |
correlation cutoff in [0, 1] above which one feature is removed |
features |
optional vector of feature names to consider (default: all numeric columns) |
keep |
optional vector of columns that should always be kept in |
Uses caret::findCorrelation on the correlation matrix computed from numeric columns.
returns an object of class feature_selection_corr
data(iris) fs <- feature_selection_corr(cutoff = 0.9) fs <- fit(fs, iris) iris_fs <- transform(fs, iris) fs$selected names(iris_fs)data(iris) fs <- feature_selection_corr(cutoff = 0.9) fs <- fit(fs, iris) iris_fs <- transform(fs, iris) fs$selected names(iris_fs)
Selects numeric predictors using forward stepwise subset search.
feature_selection_fss(attribute, features = NULL)feature_selection_fss(attribute, features = NULL)
attribute |
target attribute name |
features |
optional vector of feature names (default: all columns except |
Uses leaps::regsubsets and keeps the subset with the highest adjusted R-squared.
The target attribute must be numeric.
returns an object of class feature_selection_fss
if (requireNamespace("leaps", quietly = TRUE)) { data(Boston) fs <- feature_selection_fss("medv") fs <- fit(fs, Boston) fs$selected boston_fs <- transform(fs, Boston) names(boston_fs) }if (requireNamespace("leaps", quietly = TRUE)) { data(Boston) fs <- feature_selection_fss("medv") fs <- fit(fs, Boston) fs$selected boston_fs <- transform(fs, Boston) names(boston_fs) }
Rank and select features using information gain with optional discretization.
feature_selection_info_gain( attribute, features = NULL, top = NULL, cutoff = 0, bins = 3 )feature_selection_info_gain( attribute, features = NULL, top = NULL, cutoff = 0, bins = 3 )
attribute |
target attribute name |
features |
optional vector of feature names (default: all columns except |
top |
optional number of top features to keep |
cutoff |
minimum information gain to keep a feature (default: 0) |
bins |
number of quantile bins for numeric features |
Numeric predictors are discretized by quantile bins before computing entropy-based information gain.
returns an object of class feature_selection_info_gain
data(iris) fg <- feature_generation( IsVersicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor") ) iris_bin <- transform(fg, iris) iris_bin$IsVersicolor <- factor(iris_bin$IsVersicolor) fs <- feature_selection_info_gain("IsVersicolor", top = 2) fs <- fit(fs, iris_bin) fs$selected iris_fs <- transform(fs, iris_bin) names(iris_fs)data(iris) fg <- feature_generation( IsVersicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor") ) iris_bin <- transform(fg, iris) iris_bin$IsVersicolor <- factor(iris_bin$IsVersicolor) fs <- feature_selection_info_gain("IsVersicolor", top = 2) fs <- fit(fs, iris_bin) fs$selected iris_fs <- transform(fs, iris_bin) names(iris_fs)
Selects predictors using L1-regularized regression.
feature_selection_lasso(attribute, features = NULL)feature_selection_lasso(attribute, features = NULL)
attribute |
target attribute name |
features |
optional vector of feature names (default: all numeric columns except |
Fits a lasso path with glmnet and keeps predictors with non-zero coefficients at lambda.min.
The target attribute must be numeric.
returns an object of class feature_selection_lasso
if (requireNamespace("glmnet", quietly = TRUE)) { data(Boston) fs <- feature_selection_lasso("medv") fs <- fit(fs, Boston) fs$selected boston_fs <- transform(fs, Boston) names(boston_fs) }if (requireNamespace("glmnet", quietly = TRUE)) { data(Boston) fs <- feature_selection_lasso("medv") fs <- fit(fs, Boston) fs$selected boston_fs <- transform(fs, Boston) names(boston_fs) }
Rank and select features using a simplified RELIEF algorithm.
feature_selection_relief( attribute, features = NULL, top = NULL, cutoff = NULL, m = 50 )feature_selection_relief( attribute, features = NULL, top = NULL, cutoff = NULL, m = 50 )
attribute |
target attribute name |
features |
optional vector of feature names (default: all columns except |
top |
optional number of top features to keep |
cutoff |
optional minimum RELIEF weight to keep a feature |
m |
number of sampled instances for RELIEF updates |
For each sampled instance, the algorithm compares nearest hit/miss neighbors and updates feature weights.
returns an object of class feature_selection_relief
data(iris) fg <- feature_generation( IsVersicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor") ) iris_bin <- transform(fg, iris) iris_bin$IsVersicolor <- factor(iris_bin$IsVersicolor) fs <- feature_selection_relief("IsVersicolor", top = 2, m = 50) fs <- fit(fs, iris_bin) fs$selected transform(fs, iris_bin) |> names()data(iris) fg <- feature_generation( IsVersicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor") ) iris_bin <- transform(fg, iris) iris_bin$IsVersicolor <- factor(iris_bin$IsVersicolor) fs <- feature_selection_relief("IsVersicolor", top = 2, m = 50) fs <- fit(fs, iris_bin) fs$selected transform(fs, iris_bin) |> names()
Select features using stepwise search over generalized linear models.
feature_selection_stepwise( attribute, features = NULL, direction = "forward", family = stats::binomial, trace = 0 )feature_selection_stepwise( attribute, features = NULL, direction = "forward", family = stats::binomial, trace = 0 )
attribute |
target attribute name |
features |
optional vector of feature names (default: all columns except |
direction |
stepwise direction: "forward", "backward", or "both" |
family |
glm family passed to |
trace |
level of tracing from |
Supports forward, backward, and both directions via stats::step.
With the default binomial family, the target should represent a binary outcome.
returns an object of class feature_selection_stepwise
data(Boston) fs <- feature_selection_stepwise("medv", direction = "forward", family = stats::gaussian) fs <- fit(fs, Boston) fs$selected transform(fs, Boston) |> names()data(Boston) fs <- feature_selection_stepwise("medv", direction = "forward", family = stats::gaussian) fs <- fit(fs, Boston) fs$selected transform(fs, Boston) |> names()
Generic to train/adjust an object using provided data and optional parameters.
fit(obj, ...)fit(obj, ...)
obj |
object |
... |
optional arguments. |
returns a object after fitting
data(iris) # an example is minmax normalization trans <- minmax() trans <- fit(trans, iris) tiris <- action(trans, iris)data(iris) # an example is minmax normalization trans <- minmax() trans <- fit(trans, iris) tiris <- action(trans, iris)
Computes a smoothing spline over a sequence and returns the location/value of maximum curvature, often used as an "elbow" detector.
fit_curvature_max()fit_curvature_max()
returns an object of class fit_curvature_max, which inherits from the fit_curvature and dal_transform classes. The object contains a list with the following elements:
x: The position in which the maximum curvature is reached.
y: The value where the the maximum curvature occurs.
yfit: The value of the maximum curvature.
x <- seq(from=1,to=10,by=0.5) dat <- data.frame(x = x, value = -log(x), variable = "log") myfit <- fit_curvature_max() res <- transform(myfit, dat$value) head(res)x <- seq(from=1,to=10,by=0.5) dat <- data.frame(x = x, value = -log(x), variable = "log") myfit <- fit_curvature_max() res <- transform(myfit, dat$value) head(res)
Computes a smoothing spline over a sequence and returns the location/value of minimum curvature, complementary to maximum curvature and useful in elbow detection.
fit_curvature_min()fit_curvature_min()
Returns an object of class fit_curvature_min, which inherits from the fit_curvature and dal_transform classes. The object contains a list with the following elements:
x: The position in which the minimum curvature is reached.
y: The value where the the minimum curvature occurs.
yfit: The value of the minimum curvature.
x <- seq(from=1,to=10,by=0.5) dat <- data.frame(x = x, value = log(x), variable = "log") myfit <- fit_curvature_min() res <- transform(myfit, dat$value) head(res)x <- seq(from=1,to=10,by=0.5) dat <- data.frame(x = x, value = log(x), variable = "log") myfit <- fit_curvature_min() res <- transform(myfit, dat$value) head(res)
Tunes the hyperparameters of a machine learning model for classification
## S3 method for class 'cla_tune' fit(obj, data, ...)## S3 method for class 'cla_tune' fit(obj, data, ...)
obj |
an object containing the model and tuning configuration |
data |
the dataset used for training and evaluation |
... |
optional arguments |
a fitted obj
Fits a DBSCAN clustering model by setting the eps parameter.
If eps is not provided, it is estimated based on the k-nearest neighbor distances.
It wraps dbscan library
## S3 method for class 'cluster_dbscan' fit(obj, data, ...)## S3 method for class 'cluster_dbscan' fit(obj, data, ...)
obj |
an object containing the DBSCAN model configuration, including |
data |
the dataset to use for fitting the model |
... |
optional arguments |
returns a fitted obj with the eps parameter set
Create a categorical hierarchy from a numeric attribute using cut points.
hierarchy_cut(attribute, breaks, labels = NULL, new_attribute = NULL)hierarchy_cut(attribute, breaks, labels = NULL, new_attribute = NULL)
attribute |
numeric attribute to discretize |
breaks |
numeric breakpoints for |
labels |
optional labels for the cut intervals |
new_attribute |
name of the new attribute (default: "attribute.Level") |
returns an object of class hierarchy_cut
data(iris) hc <- hierarchy_cut( "Sepal.Length", breaks = c(-Inf, 5.5, 6.5, Inf), labels = c("baixo", "medio", "alto") ) iris_h <- transform(hc, iris) table(iris_h$Sepal.Length.Level)data(iris) hc <- hierarchy_cut( "Sepal.Length", breaks = c(-Inf, 5.5, 6.5, Inf), labels = c("baixo", "medio", "alto") ) iris_h <- transform(hc, iris) table(iris_h$Sepal.Length.Level)
Base class for supervised imputers that learn one target column from a set of source columns.
imputation_predictive(target, sources = NULL, method = c("median", "mean"))imputation_predictive(target, sources = NULL, method = c("median", "mean"))
target |
target column to impute |
sources |
optional vector of predictor column names |
method |
initial imputation method for numeric source columns: "median" or "mean" |
The target column is the attribute to be imputed. The source columns are the predictors used
to estimate missing target values. If sources = NULL, all supported columns except the target are used.
Missing values in source columns can be pre-imputed by a simpler method before fitting the predictive model.
returns an object of class imputation_predictive
data(iris) imp <- imputation_predictive( "Sepal.Length", sources = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species") ) class(imp)data(iris) imp <- imputation_predictive( "Sepal.Length", sources = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species") ) class(imp)
Impute missing values in mixed datasets using simple statistics.
imputation_simple(method = c("median", "mean"), cols = NULL)imputation_simple(method = c("median", "mean"), cols = NULL)
method |
imputation method for numeric columns: "median" or "mean" |
cols |
optional vector of column names to impute (default: all supported columns) |
Numeric columns are imputed with the mean or median. Factor, character, logical, and ordered columns are imputed with the mode (most frequent observed value). This class is intended as a low-complexity baseline for preprocessing workflows. The default recommendation of median for numeric variables follows standard data preprocessing guidance because it is less sensitive to outliers than the mean, while mode imputation is the usual baseline for categorical attributes.
returns an object of class imputation_simple
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques.
Little, R. J. A., Rubin, D. B. (2019). Statistical Analysis with Missing Data.
data(iris) iris_na <- iris iris_na$Sepal.Length[c(2, 10, 25)] <- NA iris_na$Species[c(3, 15)] <- NA imp <- imputation_simple(method = "median") imp <- fit(imp, iris_na) iris_imp <- transform(imp, iris_na) summary(iris_imp$Sepal.Length) table(iris_imp$Species, useNA = "ifany")data(iris) iris_na <- iris iris_na$Sepal.Length[c(2, 10, 25)] <- NA iris_na$Species[c(3, 15)] <- NA imp <- imputation_simple(method = "median") imp <- fit(imp, iris_na) iris_imp <- transform(imp, iris_na) summary(iris_imp$Sepal.Length) table(iris_imp$Species, useNA = "ifany")
Impute one target column from a set of source columns using a decision tree.
imputation_tree(target, sources = NULL, method = c("median", "mean"))imputation_tree(target, sources = NULL, method = c("median", "mean"))
target |
target column to impute |
sources |
optional vector of predictor column names (default: all supported columns except |
method |
initial imputation method for numeric source columns: "median" or "mean" |
The method fits a tree with the observed values of the target column and uses the
source columns as predictors. If source columns contain missing values, they are first
completed with imputation_simple() so the tree can be trained and applied. The learned
model imputes only the target column; source columns are preserved in the returned data.
returns an object of class imputation_tree
Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and Regression Trees. Wadsworth.
van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67.
data(iris) iris_na <- iris iris_na$Sepal.Length[c(2, 10, 25)] <- NA imp <- imputation_tree("Sepal.Length") imp <- fit(imp, iris_na) iris_imp <- transform(imp, iris_na) summary(iris_imp$Sepal.Length) sum(is.na(iris_imp$Sepal.Length))data(iris) iris_na <- iris iris_na$Sepal.Length[c(2, 10, 25)] <- NA imp <- imputation_tree("Sepal.Length") imp <- fit(imp, iris_na) iris_imp <- transform(imp, iris_na) summary(iris_imp$Sepal.Length) sum(is.na(iris_imp$Sepal.Length))
Optional inverse operation for a transformation; defaults to identity.
inverse_transform(obj, ...)inverse_transform(obj, ...)
obj |
a dal_transform object. |
... |
optional arguments. |
dataset inverse transformed.
#See ?minmax for an example of transformation#See ?minmax for an example of transformation
Split a dataset into k folds using a sampling strategy.
k_fold(obj, data, k)k_fold(obj, data, k)
obj |
an object representing the sampling method |
data |
dataset to be partitioned |
k |
number of folds |
returns a list of k data frames
#using random sampling sample <- sample_random() # preparing dataset into four folds folds <- k_fold(sample, iris, 4) # distribution of folds tbl <- NULL for (f in folds) { tbl <- rbind(tbl, table(f$Species)) } head(tbl)#using random sampling sample <- sample_random() # preparing dataset into four folds folds <- k_fold(sample, iris, 4) # distribution of folds tbl <- NULL for (f in folds) { tbl <- rbind(tbl, table(f$Species)) } head(tbl)
Linearly scales numeric columns to the [0,1] range per column.
minmax()minmax()
For each numeric column j, computes (x - min_j) / (max_j - min_j). Constant columns map to 0.
returns an object of class minmax
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Normalization section)
data(iris) head(iris) trans <- minmax() trans <- fit(trans, iris) tiris <- transform(trans, iris) head(tiris) itiris <- inverse_transform(trans, tiris) head(itiris)data(iris) head(iris) trans <- minmax() trans <- fit(trans, iris) tiris <- transform(trans, iris) head(tiris) itiris <- inverse_transform(trans, tiris) head(itiris)
Remove rows (or elements) that contain missing values.
na_removal()na_removal()
For data frames or matrices, removes rows with any NA. For vectors, removes NA values.
returns an object of class na_removal
data(iris) iris.na <- iris iris.na$Sepal.Length[2] <- NA obj <- na_removal() iris.clean <- transform(obj, iris.na) nrow(iris.clean)data(iris) iris.na <- iris iris.na$Sepal.Length[2] <- NA obj <- na_removal() iris.clean <- transform(obj, iris.na) nrow(iris.clean)
Removes outliers from numeric columns using Tukey's boxplot rule: values below Q1 - alpha·IQR or above Q3 + alpha·IQR are flagged as outliers.
outliers_boxplot(alpha = 1.5)outliers_boxplot(alpha = 1.5)
alpha |
boxplot outlier threshold (default 1.5, but can be 3.0 to remove extreme values) |
The default alpha=1.5 corresponds to the standard boxplot whiskers; alpha=3 is used for extreme outliers.
returns an outlier object
Tukey, J. W. (1977). Exploratory Data Analysis. Addison‑Wesley.
# code for outlier removal out_obj <- outliers_boxplot() # class for outlier analysis out_obj <- fit(out_obj, iris) # computing boundaries iris.clean <- transform(out_obj, iris) # returning cleaned dataset #inspection of cleaned dataset nrow(iris.clean) idx <- attr(iris.clean, "idx") table(idx) iris.outliers_boxplot <- iris[idx,] iris.outliers_boxplot# code for outlier removal out_obj <- outliers_boxplot() # class for outlier analysis out_obj <- fit(out_obj, iris) # computing boundaries iris.clean <- transform(out_obj, iris) # returning cleaned dataset #inspection of cleaned dataset nrow(iris.clean) idx <- attr(iris.clean, "idx") table(idx) iris.outliers_boxplot <- iris[idx,] iris.outliers_boxplot
Removes outliers from numeric columns using the 3‑sigma rule under a Gaussian assumption: values outside mean ± alpha·sd are flagged as outliers.
outliers_gaussian(alpha = 3)outliers_gaussian(alpha = 3)
alpha |
gaussian threshold (default 3) |
returns an outlier object
Pukelsheim, F. (1994). The Three Sigma Rule. The American Statistician 48(2):88–91.
# code for outlier removal out_obj <- outliers_gaussian() # class for outlier analysis out_obj <- fit(out_obj, iris) # computing boundaries iris.clean <- transform(out_obj, iris) # returning cleaned dataset #inspection of cleaned dataset nrow(iris.clean) idx <- attr(iris.clean, "idx") table(idx) iris.outliers_gaussian <- iris[idx,] iris.outliers_gaussian# code for outlier removal out_obj <- outliers_gaussian() # class for outlier analysis out_obj <- fit(out_obj, iris) # computing boundaries iris.clean <- transform(out_obj, iris) # returning cleaned dataset #inspection of cleaned dataset nrow(iris.clean) idx <- attr(iris.clean, "idx") table(idx) iris.outliers_gaussian <- iris[idx,] iris.outliers_gaussian
Frequent itemsets and association rules using arules::apriori.
pat_apriori( target = c("rules", "frequent itemsets"), supp = 0.5, conf = 0.9, minlen = 2, maxlen = 10, lhs = NULL, rhs = NULL, include = NULL, exclude = NULL, quality_filter = NULL, control = NULL )pat_apriori( target = c("rules", "frequent itemsets"), supp = 0.5, conf = 0.9, minlen = 2, maxlen = 10, lhs = NULL, rhs = NULL, include = NULL, exclude = NULL, quality_filter = NULL, control = NULL )
target |
mining target: |
supp |
minimum support threshold |
conf |
minimum confidence threshold for rules |
minlen |
minimum pattern length |
maxlen |
maximum pattern length |
lhs |
optional vector of items constrained to the left-hand side of rules |
rhs |
optional vector of items constrained to the right-hand side of rules |
include |
optional vector of items allowed in the discovered patterns |
exclude |
optional vector of items forbidden in the discovered patterns |
quality_filter |
optional quality filter created with |
control |
list of control parameters |
returns a pat_apriori object
if (requireNamespace("arules", quietly = TRUE)) { data("AdultUCI", package = "arules") trans <- suppressWarnings(methods::as(as.data.frame(AdultUCI), "transactions")) utils <- patutils() pm <- pat_apriori( target = "rules", supp = 0.2, conf = 0.85, minlen = 2, maxlen = 3, rhs = c("native-country=United-States"), quality_filter = utils$quality_min(confidence = 0.9, lift = 1.03), control = list(verbose = FALSE) ) pm <- fit(pm, trans) rules <- suppressWarnings(discover(pm, trans)) eval <- evaluate(pm, rules) eval$metrics }if (requireNamespace("arules", quietly = TRUE)) { data("AdultUCI", package = "arules") trans <- suppressWarnings(methods::as(as.data.frame(AdultUCI), "transactions")) utils <- patutils() pm <- pat_apriori( target = "rules", supp = 0.2, conf = 0.85, minlen = 2, maxlen = 3, rhs = c("native-country=United-States"), quality_filter = utils$quality_min(confidence = 0.9, lift = 1.03), control = list(verbose = FALSE) ) pm <- fit(pm, trans) rules <- suppressWarnings(discover(pm, trans)) eval <- evaluate(pm, rules) eval$metrics }
Sequential pattern mining using arulesSequences::cspade.
pat_cspade( support = 0.4, maxsize = NULL, maxlen = NULL, mingap = NULL, maxgap = NULL, quality_filter = NULL, control = list(verbose = TRUE) )pat_cspade( support = 0.4, maxsize = NULL, maxlen = NULL, mingap = NULL, maxgap = NULL, quality_filter = NULL, control = list(verbose = TRUE) )
support |
minimum support threshold |
maxsize |
maximum number of items per event |
maxlen |
maximum number of events per sequence |
mingap |
minimum gap between successive events |
maxgap |
maximum gap between successive events |
quality_filter |
optional quality filter created with |
control |
list of control parameters |
returns a pat_cspade object
if (requireNamespace("arulesSequences", quietly = TRUE)) { x <- arulesSequences::read_baskets( con = system.file("misc", "zaki.txt", package = "arulesSequences"), info = c("sequenceID", "eventID", "SIZE") ) utils <- patutils() pm <- pat_cspade( support = 0.4, maxlen = 3, quality_filter = utils$quality_min(support = 0.5) ) pm <- fit(pm, x) seqs <- discover(pm, x) eval <- evaluate(pm, seqs) eval$metrics }if (requireNamespace("arulesSequences", quietly = TRUE)) { x <- arulesSequences::read_baskets( con = system.file("misc", "zaki.txt", package = "arulesSequences"), info = c("sequenceID", "eventID", "SIZE") ) utils <- patutils() pm <- pat_cspade( support = 0.4, maxlen = 3, quality_filter = utils$quality_min(support = 0.5) ) pm <- fit(pm, x) seqs <- discover(pm, x) eval <- evaluate(pm, seqs) eval$metrics }
Frequent itemsets using arules::eclat.
pat_eclat( supp = 0.5, minlen = 1, maxlen = 3, include = NULL, exclude = NULL, quality_filter = NULL, control = NULL )pat_eclat( supp = 0.5, minlen = 1, maxlen = 3, include = NULL, exclude = NULL, quality_filter = NULL, control = NULL )
supp |
minimum support threshold |
minlen |
minimum itemset length |
maxlen |
maximum itemset length |
include |
optional vector of items allowed in the discovered itemsets |
exclude |
optional vector of items forbidden in the discovered itemsets |
quality_filter |
optional quality filter created with |
control |
list of control parameters |
returns a pat_eclat object
if (requireNamespace("arules", quietly = TRUE)) { data("AdultUCI", package = "arules") trans <- suppressWarnings(methods::as(as.data.frame(AdultUCI), "transactions")) utils <- patutils() pm <- pat_eclat( supp = 0.2, maxlen = 3, include = c("sex=Male", "income=small", "marital-status=Married-civ-spouse", "race=White"), exclude = c("income=small"), quality_filter = utils$quality_min(support = 0.4), control = list(verbose = FALSE) ) pm <- fit(pm, trans) itemsets <- discover(pm, trans) eval <- evaluate(pm, itemsets) eval$metrics }if (requireNamespace("arules", quietly = TRUE)) { data("AdultUCI", package = "arules") trans <- suppressWarnings(methods::as(as.data.frame(AdultUCI), "transactions")) utils <- patutils() pm <- pat_eclat( supp = 0.2, maxlen = 3, include = c("sex=Male", "income=small", "marital-status=Married-civ-spouse", "race=White"), exclude = c("income=small"), quality_filter = utils$quality_min(support = 0.4), control = list(verbose = FALSE) ) pm <- fit(pm, trans) itemsets <- discover(pm, trans) eval <- evaluate(pm, itemsets) eval$metrics }
Base class for frequent pattern and sequence mining.
pattern_miner()pattern_miner()
Pattern miners follow a lightweight Experiment Line:
fit() validates the mining input and stores a schema signature
discover() runs the mining algorithm on data compatible with that schema
evaluate() summarizes pattern quality and filtering effects
Different miners may normalize their inputs differently (for example, item transactions versus sequence transactions), but the base contract remains the same.
returns a pattern_miner object
Utility object that groups filtering helpers and evaluation metrics for pattern mining.
patutils()patutils()
The object groups two kinds of helpers:
quality-filter builders such as quality_min() and quality_max()
descriptive metrics for discovered patterns such as pattern count, mean support, mean confidence, mean lift, mean length, and retained ratio after filtering
returns a patutils object
utils <- patutils() utils$quality_min(confidence = 0.8, lift = 1.1)utils <- patutils() utils$quality_min(confidence = 0.8, lift = 1.1)
Draw a simple bar chart from a two‑column data.frame: first column as categories (x), second as values.
plot_bar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)plot_bar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)
data |
two‑column data.frame: category in the first column, numeric values in the second |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional fill color (single value) |
alpha |
bar transparency (0–1) |
If colors is provided, a constant fill is used; otherwise ggplot2's default palette applies.
alpha controls bar transparency. The first column is coerced to factor when needed.
returns a ggplot2::ggplot graphic
#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length)) head(data) # plotting data grf <- plot_bar(data, colors="blue") plot(grf)#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length)) head(data) # plotting data grf <- plot_bar(data, colors="blue") plot(grf)
Boxplots for each numeric column of a data.frame.
plot_boxplot(data, label_x = "", label_y = "", colors = NULL, barwidth = 0.25)plot_boxplot(data, label_x = "", label_y = "", colors = NULL, barwidth = 0.25)
data |
data.frame with one or more numeric columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional fill color for boxes |
barwidth |
width of the box (numeric) |
The data is melted to long format and a box is drawn per original column. If colors is provided,
a constant fill is applied to all boxes. Use barwidth to control box width.
returns a ggplot2::ggplot graphic
grf <- plot_boxplot(iris, colors="white") plot(grf)grf <- plot_boxplot(iris, colors="white") plot(grf)
Boxplots of a numeric column grouped by a class label.
plot_boxplot_class( data, class_label, label_x = "", label_y = "", colors = NULL )plot_boxplot_class( data, class_label, label_x = "", label_y = "", colors = NULL )
data |
data.frame with a grouping column and one numeric column |
class_label |
name of the grouping (class) column |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional fill color for the boxes |
Expects a data.frame with the grouping column named in class_label and one numeric column.
The function melts to long format and draws per‑group distributions.
returns a ggplot2::ggplot graphic
grf <- plot_boxplot_class(iris |> dplyr::select(Sepal.Width, Species), class_label = "Species", colors=c("red", "green", "blue")) plot(grf)grf <- plot_boxplot_class(iris |> dplyr::select(Sepal.Width, Species), class_label = "Species", colors=c("red", "green", "blue")) plot(grf)
Correlation heatmap with optional labels and triangle filtering.
plot_correlation( df, vars = NULL, method = c("pearson", "spearman", "kendall"), use = "pairwise.complete.obs", triangle = c("full", "upper", "lower"), reorder = c("none", "hclust", "alphabetical"), digits = 2, label_size = 3, tile_color = "white", show_diag = TRUE, title = NULL )plot_correlation( df, vars = NULL, method = c("pearson", "spearman", "kendall"), use = "pairwise.complete.obs", triangle = c("full", "upper", "lower"), reorder = c("none", "hclust", "alphabetical"), digits = 2, label_size = 3, tile_color = "white", show_diag = TRUE, title = NULL )
df |
data.frame with numeric columns |
vars |
optional vector of column names to include |
method |
correlation method: "pearson", "spearman", or "kendall" |
use |
handling of missing values for |
triangle |
which triangle to show: "full", "upper", or "lower" |
reorder |
reordering strategy: "none", "hclust", or "alphabetical" |
digits |
number of digits for labels |
label_size |
size of label text |
tile_color |
border color for tiles |
show_diag |
whether to show the diagonal |
title |
optional plot title |
Computes a correlation matrix from numeric columns (or vars) and renders a ggplot2
heatmap with values annotated. Supports reordering by hierarchical clustering or alphabetically.
returns a ggplot2::ggplot graphic
data(iris) grf <- plot_correlation(iris[,1:4]) plot(grf)data(iris) grf <- plot_correlation(iris[,1:4]) plot(grf)
Dendrogram plot for an hclust or dendrogram object using ggplot2.
plot_dendrogram(hc, labels = TRUE, label_size = 3, title = NULL)plot_dendrogram(hc, labels = TRUE, label_size = 3, title = NULL)
hc |
an object of class |
labels |
logical; whether to draw leaf labels |
label_size |
label text size |
title |
optional plot title |
Converts a dendrogram into line segments and renders it with ggplot2.
returns a ggplot2::ggplot graphic
data(iris) hc <- hclust(dist(scale(iris[,1:4])), method = "ward.D2") grf <- plot_dendrogram(hc) plot(grf)data(iris) hc <- hclust(dist(scale(iris[,1:4])), method = "ward.D2") grf <- plot_dendrogram(hc) plot(grf)
Kernel density plot for one or multiple numeric columns.
plot_density( data, label_x = "", label_y = "", colors = NULL, bin = NULL, alpha = 0.25 )plot_density( data, label_x = "", label_y = "", colors = NULL, bin = NULL, alpha = 0.25 )
data |
data.frame with one or more numeric columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional fill color (single column) or vector for groups |
bin |
optional bin width passed to |
alpha |
fill transparency (0–1) |
If data has multiple numeric columns, densities are overlaid and filled by column (group).
When a single column is provided, colors (if set) is used as a constant fill.
The bin argument is passed to geom_density(binwidth=...).
returns a ggplot2::ggplot graphic
grf <- plot_density(iris |> dplyr::select(Sepal.Width), colors="blue") plot(grf)grf <- plot_density(iris |> dplyr::select(Sepal.Width), colors="blue") plot(grf)
Kernel density plot grouped by a class label.
plot_density_class( data, class_label, label_x = "", label_y = "", colors = NULL, bin = NULL, alpha = 0.5 )plot_density_class( data, class_label, label_x = "", label_y = "", colors = NULL, bin = NULL, alpha = 0.5 )
data |
data.frame with class label and a numeric column |
class_label |
name of the grouping (class) column |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional vector of fills per class |
bin |
optional bin width passed to |
alpha |
fill transparency (0–1) |
Expects data with a grouping column named in class_label and one numeric column. Each group is
filled with a distinct color (if provided).
returns a ggplot2::ggplot graphic
grf <- plot_density_class(iris |> dplyr::select(Sepal.Width, Species), class = "Species", colors=c("red", "green", "blue")) plot(grf)grf <- plot_density_class(iris |> dplyr::select(Sepal.Width, Species), class = "Species", colors=c("red", "green", "blue")) plot(grf)
Grouped (side‑by‑side) bar chart for multiple series per category.
plot_groupedbar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)plot_groupedbar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)
data |
data.frame with category in first column and series in remaining columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional vector of fill colors, one per series |
alpha |
bar transparency (0–1) |
Expects a data.frame where the first column is the category (x) and the remaining columns are
numeric series. Bars are grouped by series. Provide colors with length equal to the number of series to set fills.
returns a ggplot2::ggplot graphic
#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length), Sepal.Width=mean(Sepal.Width)) head(data) #ploting data grf <- plot_groupedbar(data, colors=c("blue", "red")) plot(grf)#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length), Sepal.Width=mean(Sepal.Width)) head(data) #ploting data grf <- plot_groupedbar(data, colors=c("blue", "red")) plot(grf)
Histogram for a numeric column using ggplot2.
plot_hist(data, label_x = "", label_y = "", color = "white", alpha = 0.25)plot_hist(data, label_x = "", label_y = "", color = "white", alpha = 0.25)
data |
data.frame with one numeric column (first column is used if multiple) |
label_x |
x‑axis label |
label_y |
y‑axis label |
color |
fill color |
alpha |
transparency level (0–1) |
If multiple columns are provided, only the first is used. Breaks are computed via graphics::hist to
mirror base R binning. color controls the fill; alpha the transparency.
returns a ggplot2::ggplot graphic
grf <- plot_hist(iris |> dplyr::select(Sepal.Width), color=c("blue")) plot(grf)grf <- plot_hist(iris |> dplyr::select(Sepal.Width), color=c("blue")) plot(grf)
Lollipop chart (stick + circle + value label) per category.
plot_lollipop( data, label_x = "", label_y = "", colors = NULL, color_text = "black", size_text = 3, size_ball = 8, alpha_ball = 0.2, min_value = 0, max_value_gap = 1 )plot_lollipop( data, label_x = "", label_y = "", colors = NULL, color_text = "black", size_text = 3, size_ball = 8, alpha_ball = 0.2, min_value = 0, max_value_gap = 1 )
data |
data.frame with category and numeric values |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
stick/circle color |
color_text |
color of the text inside the circle |
size_text |
text size |
size_ball |
circle size |
alpha_ball |
circle transparency (0–1) |
min_value |
minimum baseline for the stick |
max_value_gap |
gap from value to stick end |
Expects a data.frame with category in the first column and numeric values in subsequent columns.
Circles are drawn at values, with vertical segments extending from min_value to value - max_value_gap.
returns a ggplot2::ggplot graphic
#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length)) head(data) #ploting data grf <- plot_lollipop(data, colors="blue", max_value_gap=0.2) plot(grf)#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length)) head(data) #ploting data grf <- plot_lollipop(data, colors="blue", max_value_gap=0.2) plot(grf)
Scatter matrix using GGally::ggpairs with optional class coloring.
plot_pair(data, cnames, title = NULL, clabel = NULL, colors = NULL)plot_pair(data, cnames, title = NULL, clabel = NULL, colors = NULL)
data |
data.frame |
cnames |
column names to include |
title |
optional title |
clabel |
optional class label column name |
colors |
optional vector of colors for classes |
returns a ggplot2::ggplot graphic
data(iris) grf <- plot_pair(iris, cnames = colnames(iris)[1:4], title = "Iris") print(grf)data(iris) grf <- plot_pair(iris, cnames = colnames(iris)[1:4], title = "Iris") print(grf)
Scatter matrix with class coloring and manual palette application.
plot_pair_adv(data, cnames, title = NULL, clabel = NULL, colors = NULL)plot_pair_adv(data, cnames, title = NULL, clabel = NULL, colors = NULL)
data |
data.frame |
cnames |
column names to include |
title |
optional title |
clabel |
optional class label column name |
colors |
optional vector of colors for classes |
returns a ggplot2::ggplot graphic
data(iris) grf <- plot_pair_adv(iris, cnames = colnames(iris)[1:4], title = "Iris") print(grf)data(iris) grf <- plot_pair_adv(iris, cnames = colnames(iris)[1:4], title = "Iris") print(grf)
Parallel coordinates plot using GGally::ggparcoord.
plot_parallel(data, columns, group, colors = NULL, title = NULL)plot_parallel(data, columns, group, colors = NULL, title = NULL)
data |
data.frame |
columns |
numeric columns to include (indices or names) |
group |
grouping column (index or name) |
colors |
optional vector of colors for groups |
title |
optional title |
returns a ggplot2::ggplot graphic
data(iris) grf <- plot_parallel(iris, columns = 1:4, group = 5) plot(grf)data(iris) grf <- plot_parallel(iris, columns = 1:4, group = 5) plot(grf)
Pie chart from a two‑column data.frame (category, value) using polar coordinates.
plot_pieplot( data, label_x = "", label_y = "", colors = NULL, textcolor = "white", bordercolor = "black" )plot_pieplot( data, label_x = "", label_y = "", colors = NULL, textcolor = "white", bordercolor = "black" )
data |
two‑column data.frame with category and value |
label_x |
x‑axis label (unused in pie, kept for symmetry) |
label_y |
y‑axis label (unused in pie) |
colors |
vector of slice fills |
textcolor |
label text color |
bordercolor |
slice border color |
Slices are sized by the second (numeric) column. Text and border colors can be customized.
returns a ggplot2::ggplot graphic
#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length)) head(data) #ploting data grf <- plot_pieplot(data, colors=c("red", "green", "blue")) plot(grf)#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length)) head(data) #ploting data grf <- plot_pieplot(data, colors=c("red", "green", "blue")) plot(grf)
Pixel-oriented visualization of a numeric matrix or data.frame.
plot_pixel( data, colors = NULL, title = NULL, label_x = "sample", label_y = "Attributes" )plot_pixel( data, colors = NULL, title = NULL, label_x = "sample", label_y = "Attributes" )
data |
numeric matrix or data.frame |
colors |
optional vector of colors for the fill gradient |
title |
optional plot title |
label_x |
x-axis label |
label_y |
y-axis label |
Renders a heatmap-like plot where each cell is a pixel. Useful for multivariate inspection.
returns a ggplot2::ggplot graphic
data(iris) grf <- plot_pixel(as.matrix(iris[,1:4]), title = "Iris") plot(grf)data(iris) grf <- plot_pixel(as.matrix(iris[,1:4]), title = "Iris") plot(grf)
Dot chart for multiple series across categories (points only).
plot_points(data, label_x = "", label_y = "", colors = NULL)plot_points(data, label_x = "", label_y = "", colors = NULL)
data |
data.frame with category + one or more numeric columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional color vector for series |
Expects a data.frame with category in the first column and one or more numeric series.
Points are colored by series (legend shows original column names). Supply colors to override the palette.
returns a ggplot2::ggplot graphic
x <- seq(0, 10, 0.25) data <- data.frame(x, sin=sin(x), cosine=cos(x)+5) head(data) grf <- plot_points(data, colors=c("red", "green")) plot(grf)x <- seq(0, 10, 0.25) data <- data.frame(x, sin=sin(x), cosine=cos(x)+5) head(data) grf <- plot_points(data, colors=c("red", "green")) plot(grf)
Radar (spider) chart for a single profile of variables using radial axes.
plot_radar(data, label_x = "", label_y = "", colors = NULL)plot_radar(data, label_x = "", label_y = "", colors = NULL)
data |
two‑column data.frame: variable name and value |
label_x |
x‑axis label (unused; variable names are shown around the circle) |
label_y |
y‑axis label |
colors |
line/fill color for the polygon |
Expects a two‑column data.frame with variable names in the first column and numeric values in the second.
The graphic is built as an n-sided polygon, where n is the number of variables, so at least three
variables are required. The function already sets the drawing limits for the full polygon; adding
ylim() or other Cartesian clipping after the fact can hide part of the radar.
returns a ggplot2::ggplot graphic
data <- data.frame(name = "Petal.Length", value = mean(iris$Petal.Length)) data <- rbind(data, data.frame(name = "Petal.Width", value = mean(iris$Petal.Width))) data <- rbind(data, data.frame(name = "Sepal.Length", value = mean(iris$Sepal.Length))) data <- rbind(data, data.frame(name = "Sepal.Width", value = mean(iris$Sepal.Width))) grf <- plot_radar(data, colors = "red") plot(grf)data <- data.frame(name = "Petal.Length", value = mean(iris$Petal.Length)) data <- rbind(data, data.frame(name = "Petal.Width", value = mean(iris$Petal.Width))) data <- rbind(data, data.frame(name = "Sepal.Length", value = mean(iris$Sepal.Length))) data <- rbind(data, data.frame(name = "Sepal.Width", value = mean(iris$Sepal.Width))) grf <- plot_radar(data, colors = "red") plot(grf)
Scatter plot from a long data.frame with columns named x, value, and variable.
plot_scatter(data, label_x = "", label_y = "", colors = NULL)plot_scatter(data, label_x = "", label_y = "", colors = NULL)
data |
long data.frame with columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional color(s); for numeric |
Colors are mapped to variable. If variable is numeric, a gradient color scale is used when colors is provided.
return a ggplot2::ggplot graphic
grf <- plot_scatter(iris |> dplyr::select(x = Sepal.Length, value = Sepal.Width, variable = Species), label_x = "Sepal.Length", label_y = "Sepal.Width", colors=c("red", "green", "blue")) plot(grf)grf <- plot_scatter(iris |> dplyr::select(x = Sepal.Length, value = Sepal.Width, variable = Species), label_x = "Sepal.Length", label_y = "Sepal.Width", colors=c("red", "green", "blue")) plot(grf)
Line plot for one or more series over a common x index.
plot_series(data, label_x = "", label_y = "", colors = NULL)plot_series(data, label_x = "", label_y = "", colors = NULL)
data |
data.frame with x in the first column and series in remaining columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional vector of colors for series |
Expects a data.frame where the first column is the x index and remaining columns are numeric series.
Points and lines are drawn per series; supply colors to override the palette.
returns a ggplot2::ggplot graphic
x <- seq(0, 10, 0.25) data <- data.frame(x, sin=sin(x)) head(data) grf <- plot_series(data, colors=c("red")) plot(grf)x <- seq(0, 10, 0.25) data <- data.frame(x, sin=sin(x)) head(data) grf <- plot_series(data, colors=c("red")) plot(grf)
Stacked bar chart for multiple series per category.
plot_stackedbar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)plot_stackedbar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)
data |
data.frame with category in first column and series in remaining columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional vector of fill colors, one per series |
alpha |
bar transparency (0–1) |
Expects a data.frame with category in the first column and series in remaining columns.
Bars are stacked within each category. Provide colors (one per series) to control fills.
returns a ggplot2::ggplot graphic
#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length), Sepal.Width=mean(Sepal.Width)) #plotting data grf <- plot_stackedbar(data, colors=c("blue", "red")) plot(grf)#summarizing iris dataset data <- iris |> dplyr::group_by(Species) |> dplyr::summarize(Sepal.Length=mean(Sepal.Length), Sepal.Width=mean(Sepal.Width)) #plotting data grf <- plot_stackedbar(data, colors=c("blue", "red")) plot(grf)
Simple time series plot with points and a line.
plot_ts(x = NULL, y, label_x = "", label_y = "", color = "black")plot_ts(x = NULL, y, label_x = "", label_y = "", color = "black")
x |
time index (numeric vector) or NULL to use 1:length(y) |
y |
numeric series |
label_x |
x‑axis label |
label_y |
y‑axis label |
color |
color for the series |
If x is NULL, an integer index 1:n is used. The color applies to both points and line.
returns a ggplot2::ggplot graphic
x <- seq(0, 10, 0.25) y <- sin(x) grf <- plot_ts(x = x, y = y, color=c("red")) plot(grf)x <- seq(0, 10, 0.25) y <- sin(x) grf <- plot_ts(x = x, y = y, color=c("red")) plot(grf)
Plot original series plus dashed lines for in‑sample adjustment and optional out‑of‑sample predictions.
plot_ts_pred( x = NULL, y, yadj, ypred = NULL, label_x = "", label_y = "", color = "black", color_adjust = "blue", color_prediction = "green" )plot_ts_pred( x = NULL, y, yadj, ypred = NULL, label_x = "", label_y = "", color = "black", color_adjust = "blue", color_prediction = "green" )
x |
time index (numeric vector) or NULL to use 1:length(y) |
y |
numeric time series |
yadj |
fitted/adjusted values for the training window |
ypred |
optional predicted values after the training window |
label_x |
x‑axis title |
label_y |
y‑axis title |
color |
color for the original series |
color_adjust |
color for the adjusted values (dashed) |
color_prediction |
color for the predictions (dashed) |
yadj length defines the training segment; ypred (if provided) is appended after yadj.
returns a ggplot2::ggplot graphic
x <- base::seq(0, 10, 0.25) yvalues <- sin(x) + rnorm(41,0,0.1) adjust <- sin(x[1:35]) prediction <- sin(x[36:41]) grf <- plot_ts_pred(y=yvalues, yadj=adjust, ypred=prediction) plot(grf)x <- base::seq(0, 10, 0.25) yvalues <- sin(x) + rnorm(41,0,0.1) adjust <- sin(x[1:35]) prediction <- sin(x[36:41]) grf <- plot_ts_pred(y=yvalues, yadj=adjust, ypred=prediction) plot(grf)
Ancestor class for supervised predictors (classification and regression).
Provides a default fit() to record feature names and proxies action() to predict().
An example predictor is a decision tree classifier (cla_dtree).
predictor()predictor()
returns a predictor object
#See ?cla_dtree for a classification example using a decision tree#See ?cla_dtree for a classification example using a decision tree
Regression tree using recursive partitioning via the tree package.
reg_dtree(attribute)reg_dtree(attribute)
attribute |
attribute target to model building. |
Splits are chosen to reduce squared error within nodes; result is an interpretable set of piecewise constants.
returns a decision tree regression object
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.
data(Boston) model <- reg_dtree("medv") # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metricsdata(Boston) model <- reg_dtree("medv") # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics
KNN regression using FNN::knn.reg, predicting by averaging the targets of the k nearest neighbors.
reg_knn(attribute, k)reg_knn(attribute, k)
attribute |
attribute target to model building |
k |
number of k neighbors |
Non‑parametric approach suitable for local smoothing. Sensitive to feature scaling; consider normalization beforehand.
returns a knn regression object
Altman, N. (1992). An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression.
data(Boston) model <- reg_knn("medv", k=3) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metricsdata(Boston) model <- reg_knn("medv", k=3) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics
Linear regression using stats::lm.
reg_lm(formula = NULL, attribute = NULL, features = NULL)reg_lm(formula = NULL, attribute = NULL, features = NULL)
formula |
optional regression formula (e.g., y ~ x1 + x2). |
attribute |
target attribute name (used when formula is NULL) |
features |
optional vector of feature names (used when formula is NULL) |
returns a reg_lm object
if (requireNamespace("MASS", quietly = TRUE)) { data(Boston, package = "MASS") # Simple linear regression model_simple <- reg_lm(formula = medv ~ lstat) model_simple <- fit(model_simple, Boston) pred_simple <- predict(model_simple, Boston) eval_simple <- evaluate(model_simple, Boston$medv, pred_simple) eval_simple$metrics # Polynomial regression (degree 2) model_poly <- reg_lm(formula = medv ~ poly(lstat, 2, raw = TRUE)) model_poly <- fit(model_poly, Boston) pred_poly <- predict(model_poly, Boston) eval_poly <- evaluate(model_poly, Boston$medv, pred_poly) eval_poly$metrics # Multiple regression model_multi <- reg_lm(formula = medv ~ lstat + rm + ptratio) model_multi <- fit(model_multi, Boston) pred_multi <- predict(model_multi, Boston) eval_multi <- evaluate(model_multi, Boston$medv, pred_multi) eval_multi$metrics }if (requireNamespace("MASS", quietly = TRUE)) { data(Boston, package = "MASS") # Simple linear regression model_simple <- reg_lm(formula = medv ~ lstat) model_simple <- fit(model_simple, Boston) pred_simple <- predict(model_simple, Boston) eval_simple <- evaluate(model_simple, Boston$medv, pred_simple) eval_simple$metrics # Polynomial regression (degree 2) model_poly <- reg_lm(formula = medv ~ poly(lstat, 2, raw = TRUE)) model_poly <- fit(model_poly, Boston) pred_poly <- predict(model_poly, Boston) eval_poly <- evaluate(model_poly, Boston$medv, pred_poly) eval_poly$metrics # Multiple regression model_multi <- reg_lm(formula = medv ~ lstat + rm + ptratio) model_multi <- fit(model_multi, Boston) pred_multi <- predict(model_multi, Boston) eval_multi <- evaluate(model_multi, Boston$medv, pred_multi) eval_multi$metrics }
Multi-Layer Perceptron regression using nnet::nnet (single hidden layer).
reg_mlp(attribute, size = NULL, decay = 0.05, maxit = 1000)reg_mlp(attribute, size = NULL, decay = 0.05, maxit = 1000)
attribute |
attribute target to model building |
size |
number of neurons in hidden layers |
decay |
decay learning rate |
maxit |
number of maximum iterations for training |
Feedforward neural network with size hidden units and L2 regularization controlled by decay.
Data should be scaled for stable training.
returns a object of class reg_mlp
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
data(Boston) model <- reg_mlp("medv", size=5, decay=0.54) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metricsdata(Boston) model <- reg_mlp("medv", size=5, decay=0.54) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics
Regression via Random Forests, an ensemble of decision trees trained
on bootstrap samples with random feature subsetting at each split. This wrapper
uses the randomForest package API.
reg_rf(attribute, nodesize = 1, ntree = 10, mtry = NULL)reg_rf(attribute, nodesize = 1, ntree = 10, mtry = NULL)
attribute |
attribute target to model building |
nodesize |
node size |
ntree |
number of trees |
mtry |
number of attributes to build tree |
Random Forests reduce variance and are robust to overfitting on tabular data.
Key hyperparameters are the number of trees (ntree), the number of variables tried at
each split (mtry), and the minimum node size (nodesize).
returns an object of class reg_rfobj
Breiman, L. (2001). Random Forests. Machine Learning 45(1):5–32. Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News.
data(Boston) model <- reg_rf("medv", ntree=10) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metricsdata(Boston) model <- reg_rf("medv", ntree=10) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics
Support Vector Regression (SVR) using e1071::svm.
reg_svm( attribute, epsilon = 0.1, cost = 10, kernel = c("radial", "linear", "polynomial", "sigmoid") )reg_svm( attribute, epsilon = 0.1, cost = 10, kernel = c("radial", "linear", "polynomial", "sigmoid") )
attribute |
attribute target to model building |
epsilon |
parameter that controls the width of the margin around the separating hyperplane |
cost |
parameter that controls the trade-off between having a wide margin and correctly classifying training data points |
kernel |
the type of kernel function to be used in the SVM algorithm (linear, radial, polynomial, sigmoid) |
SVR optimizes a margin with an epsilon‑insensitive loss around the regression function.
The cost controls regularization strength; epsilon sets the width of the insensitive tube; and
kernel defines the feature map (linear, radial, polynomial, sigmoid).
returns a SVM regression object
Drucker, H., Burges, C., Kaufman, L., Smola, A., Vapnik, V. (1997). Support Vector Regression Machines. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines.
data(Boston) model <- reg_svm("medv", epsilon=0.2,cost=40.000) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metricsdata(Boston) model <- reg_svm("medv", epsilon=0.2,cost=40.000) # preparing dataset for random sampling sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test model <- fit(model, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics
Tune hyperparameters of a base regressor via k‑fold cross‑validation minimizing an error metric (MSE).
reg_tune(base_model, folds = 10, ranges = NULL)reg_tune(base_model, folds = 10, ranges = NULL)
base_model |
base model for tuning |
folds |
number of folds for cross-validation |
ranges |
a list of hyperparameter ranges to explore |
returns a reg_tune object.
Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.
# preparing dataset for random sampling data(Boston) sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test # hyper parameter setup tune <- reg_tune(reg_mlp("medv"), folds=3, ranges = list(size=c(3), decay=c(0.1,0.5))) # hyper parameter optimization model <- fit(tune, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics# preparing dataset for random sampling data(Boston) sr <- sample_random() sr <- train_test(sr, Boston) train <- sr$train test <- sr$test # hyper parameter setup tune <- reg_tune(reg_mlp("medv"), folds=3, ranges = list(size=c(3), decay=c(0.1,0.5))) # hyper parameter optimization model <- fit(tune, train) test_prediction <- predict(model, test) test_predictand <- test[,"medv"] test_eval <- evaluate(model, test_predictand, test_prediction) test_eval$metrics
Ancestor class for regression models. Stores the target attribute and provides common evaluation metrics.
regression(attribute)regression(attribute)
attribute |
attribute target to model building |
returns a regression object
#See ?reg_dtree for a regression example using a decision tree#See ?reg_dtree for a regression example using a decision tree
Balance class distribution using up-sampling or down-sampling.
sample_balance(attribute, method = c("down", "up"))sample_balance(attribute, method = c("down", "up"))
attribute |
target class attribute name |
method |
balancing method: "down" or "up" |
returns an object of class sample_balance
data(iris) iris_imb <- iris[iris$Species != "setosa", ] sb <- sample_balance("Species", method = "down") iris_bal <- transform(sb, iris_imb) table(iris_bal$Species)data(iris) iris_imb <- iris[iris$Species != "setosa", ] sb <- sample_balance("Species", method = "down") iris_bal <- transform(sb, iris_imb) table(iris_bal$Species)
Sample entire groups defined by a categorical attribute. In sampling theory, this design is known as cluster sampling (also called one-stage cluster sampling or sampling by groups). The groups are assumed to be pre-defined in the data; this function does not infer groups with clustering algorithms such as k-means.
sample_groups(attribute, n_groups)sample_groups(attribute, n_groups)
attribute |
group-defining attribute name |
n_groups |
number of groups to sample |
returns an object of class sample_groups
data(iris) sc <- sample_groups("Species", n_groups = 2) iris_sc <- transform(sc, iris) table(iris_sc$Species)data(iris) sc <- sample_groups("Species", n_groups = 2) iris_sc <- transform(sc, iris) table(iris_sc$Species)
Train/test split and k‑fold partitioning by simple random sampling.
sample_random()sample_random()
returns an object of class 'sample_random
#using random sampling sample <- sample_random() tt <- train_test(sample, iris) # distribution of train table(tt$train$Species) # preparing dataset into four folds folds <- k_fold(sample, iris, 4) # distribution of folds tbl <- NULL for (f in folds) { tbl <- rbind(tbl, table(f$Species)) } head(tbl)#using random sampling sample <- sample_random() tt <- train_test(sample, iris) # distribution of train table(tt$train$Species) # preparing dataset into four folds folds <- k_fold(sample, iris, 4) # distribution of folds tbl <- NULL for (f in folds) { tbl <- rbind(tbl, table(f$Species)) } head(tbl)
Sample rows or elements with or without replacement.
sample_simple(size, replace = FALSE, prob = NULL)sample_simple(size, replace = FALSE, prob = NULL)
size |
number of samples to draw |
replace |
logical; sample with replacement if TRUE |
prob |
optional vector of sampling probabilities |
returns an object of class sample_simple
data(iris) srswor <- sample_simple(size = 10, replace = FALSE) srswr <- sample_simple(size = 10, replace = TRUE) sample_wor <- transform(srswor, iris$Sepal.Length) sample_wr <- transform(srswr, iris$Sepal.Length) sample_wor sample_wrdata(iris) srswor <- sample_simple(size = 10, replace = FALSE) srswr <- sample_simple(size = 10, replace = TRUE) sample_wor <- transform(srswor, iris$Sepal.Length) sample_wr <- transform(srswr, iris$Sepal.Length) sample_wor sample_wr
Train/test split and k‑fold partitioning that preserve the target class proportions (strata).
sample_stratified(attribute)sample_stratified(attribute)
attribute |
attribute target to model building |
returns an object of class sample_stratified
#using stratified sampling sample <- sample_stratified("Species") tt <- train_test(sample, iris) # distribution of train table(tt$train$Species) # preparing dataset into four folds folds <- k_fold(sample, iris, 4) # distribution of folds tbl <- NULL for (f in folds) { tbl <- rbind(tbl, table(f$Species)) } head(tbl)#using stratified sampling sample <- sample_stratified("Species") tt <- train_test(sample, iris) # distribution of train table(tt$train$Species) # preparing dataset into four folds folds <- k_fold(sample, iris, 4) # distribution of folds tbl <- NULL for (f in folds) { tbl <- rbind(tbl, table(f$Species)) } head(tbl)
Generic to select the best hyperparameters from cross‑validation results; subclasses can override.
select_hyper(obj, hyperparameters)select_hyper(obj, hyperparameters)
obj |
the object or model used for hyperparameter selection. |
hyperparameters |
data set with hyper parameters and quality measure from execution |
returns the index of selected hyper parameter
Selects the optimal hyperparameter by maximizing the average classification metric. It wraps dplyr library.
## S3 method for class 'cla_tune' select_hyper(obj, hyperparameters)## S3 method for class 'cla_tune' select_hyper(obj, hyperparameters)
obj |
an object representing the model or tuning process |
hyperparameters |
a dataframe with columns |
returns a optimized key number of hyperparameters
Assign a named list of parameters to matching fields in the object (best‑effort).
set_params(obj, params)set_params(obj, params)
obj |
object of class dal_base |
params |
parameters to set obj |
returns an object with parameters set
obj <- set_params(dal_base(), list(x = 0))obj <- set_params(dal_base(), list(x = 0))
Default method for set_params (returns object unchanged).
## Default S3 method: set_params(obj, params)## Default S3 method: set_params(obj, params)
obj |
object |
params |
parameters |
returns the object unchanged
Family of smoothing methods that reduce noise by replacing values with the mean of a bin. Supported strategies include equal‑interval bins, equal‑frequency (quantile) bins, k-means quantization, and class-aware clustering.
smoothing(n)smoothing(n)
n |
number of bins |
The smoothing level is controlled by n (number of bins/levels). The base helper tune()
chooses n by locating the elbow (maximum curvature) of the MSE curve across candidates.
Concrete subclasses may override that criterion when supervision is required. After fit(),
values are mapped to bin means via transform().
returns an object of class smoothing
data(iris) obj <- smoothing_inter(n = 2) obj <- fit(obj, iris$Sepal.Length) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropydata(iris) obj <- smoothing_inter(n = 2) obj <- fit(obj, iris$Sepal.Length) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropy
Discretize a numeric attribute into n bins by clustering the attribute together
with a one-hot representation of the class label, then projecting the clusters back to
ordered cut points on the numeric axis.
smoothing_cluster(class_label, n)smoothing_cluster(class_label, n)
class_label |
name of the class attribute |
n |
number of bins |
returns an object of class smoothing_cluster
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Discretization)
data(iris) cluster_data <- iris[, c("Sepal.Length", "Species")] obj <- smoothing_cluster("Species", n = 2) obj <- fit(obj, cluster_data) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropydata(iris) cluster_data <- iris[, c("Sepal.Length", "Species")] obj <- smoothing_cluster("Species", n = 2) obj <- fit(obj, cluster_data) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropy
Discretize a numeric vector into n bins with approximately equal frequency (quantile cuts),
and replace each value by the mean of its bin.
smoothing_freq(n)smoothing_freq(n)
n |
number of bins |
returns an object of class smoothing_freq
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Discretization)
data(iris) obj <- smoothing_freq(n = 2) obj <- fit(obj, iris$Sepal.Length) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropydata(iris) obj <- smoothing_freq(n = 2) obj <- fit(obj, iris$Sepal.Length) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropy
Discretize a numeric vector into n equal‑width intervals (robust bounds via boxplot whiskers)
and replace each value by the bin mean.
smoothing_inter(n)smoothing_inter(n)
n |
number of bins |
returns an object of class smoothing_inter
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Discretization)
data(iris) obj <- smoothing_inter(n = 2) obj <- fit(obj, iris$Sepal.Length) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropydata(iris) obj <- smoothing_inter(n = 2) obj <- fit(obj, iris$Sepal.Length) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropy
Quantize a numeric vector into n levels using k‑means on the values and
replace each value by its cluster mean (vector quantization).
smoothing_quantization(n)smoothing_quantization(n)
n |
number of bins |
returns an object of class smoothing_quantization
MacQueen, J. (1967). Some Methods for classification and Analysis of Multivariate Observations.
data(iris) obj <- smoothing_quantization(n = 2) obj <- fit(obj, iris$Sepal.Length) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropydata(iris) obj <- smoothing_quantization(n = 2) obj <- fit(obj, iris$Sepal.Length) sl.bi <- transform(obj, iris$Sepal.Length) table(sl.bi) obj$interval bins <- cut(iris$Sepal.Length, unique(obj$interval.adj), FALSE, include.lowest = TRUE) entro <- evaluate(obj, bins, iris$Species) entro$entropy
Partition a dataset into training and test sets using a sampling strategy.
train_test(obj, data, perc = 0.8, ...)train_test(obj, data, perc = 0.8, ...)
obj |
an object of a class that supports the |
data |
dataset to be partitioned |
perc |
a numeric value between 0 and 1 specifying the proportion of data to be used for training |
... |
additional optional arguments passed to specific methods. |
returns an list with two elements:
train: A data frame containing the training set
test: A data frame containing the test set
#using random sampling sample <- sample_random() tt <- train_test(sample, iris) # distribution of train table(tt$train$Species)#using random sampling sample <- sample_random() tt <- train_test(sample, iris) # distribution of train table(tt$train$Species)
Splits a dataset into training and test sets based on k-fold cross-validation. The function takes a list of data partitions (folds) and a specified fold index k. It returns the data corresponding to the k-th fold as the test set, and combines all other folds to form the training set.
train_test_from_folds(folds, k)train_test_from_folds(folds, k)
folds |
data partitioned into folds |
k |
k-fold for test set, all reminder for training set |
returns a list with two elements:
train: A data frame containing the combined data from all folds except the k-th fold, used as the training set.
test: A data frame corresponding to the k-th fold, used as the test set.
# Create k-fold partitions of a dataset (e.g., iris) folds <- k_fold(sample_random(), iris, k = 5) # Use the first fold as the test set and combine the remaining folds for the training set train_test_split <- train_test_from_folds(folds, k = 1) # Display the training set head(train_test_split$train) # Display the test set head(train_test_split$test)# Create k-fold partitions of a dataset (e.g., iris) folds <- k_fold(sample_random(), iris, k = 5) # Use the first fold as the test set and combine the remaining folds for the training set train_test_split <- train_test_from_folds(folds, k = 1) # Display the training set head(train_test_split$train) # Display the test set head(train_test_split$test)
Generic to apply a transformation to data.
transform(obj, ...)transform(obj, ...)
obj |
a |
... |
optional arguments. |
returns a transformed data.
#See ?minmax for an example of transformation#See ?minmax for an example of transformation
Standardize numeric columns to zero mean and unit variance, optionally rescaled to a target mean (nmean) and sd (nsd).
zscore(nmean = 0, nsd = 1)zscore(nmean = 0, nsd = 1)
nmean |
new mean for normalized data |
nsd |
new standard deviation for normalized data |
For each numeric column j, computes ((x - mean_j)/sd_j) * nsd + nmean. Constant columns become nmean.
returns the z-score transformation object
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Standardization)
data(iris) head(iris) trans <- zscore() trans <- fit(trans, iris) tiris <- transform(trans, iris) head(tiris) itiris <- inverse_transform(trans, tiris) head(itiris)data(iris) head(iris) trans <- zscore() trans <- fit(trans, iris) tiris <- transform(trans, iris) head(tiris) itiris <- inverse_transform(trans, tiris) head(itiris)