Running a Model
The core model is an object called ccount.core.CorrelatedModel
. Each of the types of models you can fit are explained in the models documentation.
If you are new to using this code base, please read the following sections about model specification,
fitting the model, and creating predictions.
We have recently added a ModelRun
class that combines all of the steps in the aforementioned sections to limit the amount
of code that you need to write. It also allows you to use the data bootstrap method to produce confidence intervals for your predictions. See here for documentation.
Specifying the Model
To specify a model, make sure that all variables that you need are contained within one data frame, including covariates, random effects, offsets, and weights. The outcome variables should be wide, but all other variables should be long, e.g. have one column for deaths and another column for cases, rather than one column with deaths and cases stacked on top of each other. All of these variables need to be filled in for every entry (i.e. there can be no missing values -- if you want to predict for somewhere with missing outcome information, you can do that later with the function that makes predictions).
The class that you can use to initialize a model is ccount.run.ModelRun
.
This class makes it easy to fit the model and to predict for a new data frame. It takes the following arguments:
model_type
:(str)
The name of the model type to be fit. The list of names are in models.df
:(pd.DataFrame)
A data frame that contains all of your variables.outcome_variables
:(List[str])
A list of the names of the outcome variables.fixed_effects
:(List[List[List[str] or None]])
Nested lists of the names of fixed effects to put on each of the parameters and outcomes. See here for details. If you do not want to put any covariates on a parameter-outcome combination, passNone
instead ofList[str]
. Note: Intercepts are automatically added for each parameter-outcome. Do not add another intercept.random_effect
:(str)
Name of the variable that specifies the grouping of the random effect.spline
:(List[List[List[dict] or None]])
Nested lists of dictionaries that gives the spline specificationoffset
:List[str]
(optional) List of variable names to use as an offset for each parameter. Must be of the length of the number of parameters in each model, and in the correct order. See here for the number and order of parameters for each model type. To include an offset on only one parameter, pass a list of the variable name andNone
, in the correct order corresponding to your model type.weight
:(str)
(optional) Name of the variable that specifies the weight to place on each observation.**kwargs
: Additional argumentsnormalize_X
:(bool)
Whether or not to scale the covariates by their mean and standard deviation. By default,normalize_X = True
. The resulting parameters are transformed after fitting so that they can be interpreted in the original space as the covariates.add_intercepts
:(bool)
Whether or not to add intercepts for all parameter-outcomes. By default,add_intercepts = True
.
Spline Specification
Fitting a spline requires additional information. For each spline that you want to add, you pass a dictionary rather than a string (like you do for fixed effects). The dictionary must have the following keys:
name
:(str)
variable name in your dataset for the splineknots_type
:(str)
one of"frequency"
(put the knots at equal quantiles of the data) or"domain"
(put the knots at equal spacings over the domain of the variable)knots_num
:(int)
number of knots for the splinedegree
:(int)
degree of differentiation for the spline (e.g. degree of 3 is a cubic spline, most common for this application)r_linear
:(bool)
enforce linear function (rather than a potentially degree > 1 spline) at the tails of your data on the rightl_linear
:(bool)
enforce linear function (rather than a potentially degree > 1 spline) at the tails of your data on the left
Example Specification
Consider a pandas data frame, df
, that has the following columns: deaths
, cases
, median_income
, population
, country
, age
. In my example below, I will fit a Zero-Inflated Poisson model (described here) with the following:
Outcomes
- predict both deaths due to and cases of a disease, where deaths and cases are correlated with one another (2-dimensional outcome)
Fixed Effects
- median income as a predictor for the mean of deaths
- age as a predictor for the mean of cases
- no predictors for the probability of a structural zero
Splines - a spline over calendar year with 3 degrees of freedom (cubic), and 3 knots (two at the min and max of the variable, one at the median), no linearity in tails - only applied to the mean, but applied to both cases and deaths
Let's specify the spline here, and in the example we will apply the spline specification to the mean function for both outcomes.
spline_specs = {
'name': 'year',
'knots_type': 'frequency',
'knots_num': 3,
'degree': 3,
'r_linear': False,
'l_linear': False
}
Random Effect
- random effect grouping by country, so that all data points within the same country will have the same 2-dimensional realization that correlates their deaths with their cases
Offset
- population as the "offset" for the mean, or the denominator that deaths and cases came from
from ccount.run import convert_df_to_model
model = convert_df_to_model(
model_type='zero_inflated_poisson',
df=df,
outcome_variables=['deaths', 'cases'],
fixed_effects=[[None, None], [['median_income'], ['age']]],
random_effect='country',
offset=[None, 'population'],
spline=[[None, None],
[[spline_specs], [spline_specs]]
]
)
Fitting the Model
Once you have the model object, returned by the function model = convert_df_to_model(...)
, we can estimate the parameters. The class method ccount.core.CorrelatedModel.optimize_params
does the optimization work.
From our example above, to fit the model, we simply run:
model.optimize_params(max_iters=100, max_beta_iters=10, max_U_iters=10, rel_tol=1e-4)
The parameters that are being optimized are the fixed effects and random effects. The optimization happens iteratively,
and you may not need to completely optimize the fixed and random effects in each overall iteration. To control this process, you
may pass in any integer for max_iters
, max_beta_iters
, and max_U_iters
. You can also
include a relative tolerance for the total error in the fixed and random effects with the argument rel_tol
, and the
program will terminate if it reaches rel_tol
before completing max_iters
iterations. We recommend the defaults above for all
of these arguments.
The parameter estimates, including the fixed effects, the random effects, and the correlation between the outcomes given by (each described in methods) are all available in the summarize
class method for ccount.core.CorrelatedModel
. In our example above, to get a printed summary of the estimates (both transformed and un-transformed based on the link functions described in the model choices), run the following:
model.summarize(file=...)
If you want to save the model summary to a file so that you can access it later, pass file=...
and it will re-route the output to whatever file name (must have a .txt
extension) that you pass.
If you do not pass a file, i.e. file=None
, then it will return the summary in your session.
Creating Predictions
In order to get the fitted values of deaths and cases, you can use the function ccount.run.get_predictions_from_df
. It takes the same arguments as convert_df_to_model
, except that in place of model_type
, it needs a ccount.core.CorrelatedModel
object, and it does not need the outcome variables. For our example above, this looks like
predictions
is a 2-dimensional array that is the length of the data frame some_df
along one axis and the outcome predictions along the other (i.e. predictions for deaths and cases).
The fixed_effects
, random_effect
, offset
(if applicable), and weight
(if applicable) arguments must be identical to those used in the initial model specification.
Note that the data frame passed to get_predictions_from_df
need not be the data frame that was used in model fitting. It can be a new data frame with missing outcome variables because they are not used in making predictions since the model has already been fit. However, this new data frame cannot have any missing values for random effects, fixed effects, or offsets. If it includes random effect grouping levels that were not observed in the fitting of the model, the random effect will be 0 for that level.
Easy Model Launching
To run a model with less code, you can use the following class that takes all of the same arguments
as convert_df_to_model
(see here), optimize_params
(see here)
with some exceptions:
from ccount.run import ModelRun
model = ModelRun(training_df, prediction_df, bootstraps=None, **kwargs)
training_df
:(pd.DataFrame)
your dataset that you want to train the model on (previously calleddf
inconvert_df_to_model
)prediction_df
:(pd.DataFrame)
your dataset that you want to make predictions for (previously calleddf
inget_predictions_from_df
)- NEW:
bootstraps
:(optional int)
number of data bootstraps to perform for uncertainty **kwargs
: all arguments that were previously passed toconvert_df_to_model
andoptimize_params
When you initialize the ModelRun
, it will set up your main model and then each of the models that will
be used for the data bootstraps (see here for a description of the data bootstrap method).
To fit the model, use the function ModelRun.run()
. If you are using the boostrap functionality,
you can fit the model for each bootstrapped dataset in parallel with multiprocessing pools
by passing a number of pools to run
, like
model.run(pools=5)
To make predictions, use the function ModelRun.predict(alpha=0.05)
, where
alpha
corresponds to a 1 - alpha
confidence level (e.g. 95% confidence interval). There is no need to pass
additional arguments to the run()
and predict()
functions because all information about the optimization and
prediction data frame was passed in to the ModelRun
**kwargs**
.
Bootstrap Data
If you pass individual-record data, you can use the built-in bootstrap function for ModelRun
.
However, if you have grouped data, you should use the helper function ccount.processing.resample_data
to resample your data based on the observed probability of having the outcome. The data
needs to be structured with some id columns, a size column representing the count,
and an outcome column of 1's and 0's corresponding to that count. The
resample_data
function will return a list of data frames as long
as the number of resamples you requested. An example is below.
import pandas as pd
import numpy as np
from ccount.processing import resample_data
df = pd.DataFrame({
'group': np.repeat([1, 2, 3, 4, 5], repeats=2),
'outcome': np.tile([0, 1], reps=5),
'size': np.random.randint(low=10, high=20, size=10)
})
data_frames = resample_data(df=df, outcome_col='outcome', id_cols=['group'], size_col='size', num_samples=10)
You can then pass this new data_frames
list of data frames to the ModelRun
function
as the boostrap_dfs
argument in the init.