You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
mamlr/man/modelizer.Rd

52 lines
2.7 KiB

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/modelizer.R
\name{modelizer}
\alias{modelizer}
\title{Generate a classification model}
\usage{
modelizer(dfm, cores_outer, cores_grid, cores_inner, cores_feats, seed,
outer_k, inner_k, model, class_type, opt_measure, country, grid)
}
\arguments{
\item{dfm}{A quanteda dfm used to train and evaluate the model, should contain the vector with class labels in docvars}
\item{cores_outer}{Number of cores to use for outer CV (cannot be more than the number of outer folds)}
\item{cores_grid}{Number of cores to use for grid search (cannot be more than the number of grid rows (i.e. possible parameter combinations), multiplies with cores_outer)}
\item{cores_inner}{Number of cores to use for inner CV loop (cannot be more than number of inner CV folds, multiplies with cores_outer and cores_grid)}
\item{cores_feats}{Number of cores to use for feature selection (multiplies with cores outer, cores_grid and cores_inner)}
\item{seed}{Integer to use as seed for random number generation, ensures replicability}
\item{outer_k}{Number of outer cross-validation folds (for performance estimation)}
\item{inner_k}{Number of inner cross-validation folds (for hyperparameter optimization and feature selection)}
\item{model}{Classification algorithm to use (currently only "nb" for Naïve Bayes using textmodel_nb)}
\item{class_type}{Type of classification to model ("junk", "aggregate", or "codes")}
\item{opt_measure}{Label of measure in confusion matrix to use as performance indicator}
\item{country}{Two-letter country abbreviation of the country the model is estimated for (used for filename)}
\item{grid}{Data frame providing all possible combinations of hyperparameters and feature selection parameters for a given model (grid search)}
}
\value{
An .RData file in the current working directory (getwd()) containing the final model, performance estimates and the parameters used for grid search and cross-validation
}
\description{
Generate a nested cross validated classification model based on a dfm with class labels as docvars
Currently only supports Naïve Bayes using quanteda's textmodel_nb
Hyperparemeter optimization is enabled through the grid parameter
A grid should be generated from vectors with the labels as described for each model, using the crossing() command
For Naïve Bayes, the following parameters can be used:
- percentiles (cutoff point for tf-idf feature selection)
- measures (what measure to use for determining feature importance, see textstat_keyness for options)
}
\examples{
modelizer(dfm, cores_outer = 1, cores_grid = 1, cores_inner = 1, cores_feats = 1, seed = 42, outer_k = 3, inner_k = 5, model = model, class_type = class_type, opt_measure = opt_measure, country = country, grid = grid)
}