This function pre-processes the data for the application of a ReSurv model.

IndividualDataPP(
  data,
  id = NULL,
  continuous_features = NULL,
  categorical_features = NULL,
  accident_period,
  calendar_period,
  input_time_granularity = "months",
  output_time_granularity = "quarters",
  years = NULL,
  calendar_period_extrapolation = FALSE,
  continuous_features_spline = NULL,
  degrees_cf = 3,
  degrees_of_freedom_cf = 4,
  degrees_cp = 3,
  degrees_of_freedom_cp = 4
)

Arguments

data

data.frame, for the individual reserving. The number of development periods can be larger than the number of accident periods.

id

character, data column that contains the policy identifier. If NULL (default), we assume that each row is an observation. We assume that each observation can only have one reporting time, if not null we take the reporting time of the first row for each id.

continuous_features

character, continuous features columns to be scaled.

categorical_features

character, categorical features columns to be one-hot encoded.

accident_period

character, it contains the name of the column in data corresponding to the accident period.

calendar_period

character, it contains the name of the column in data corresponding to the calendar period.

input_time_granularity

character, time unit of the input data. Granularity supported:

  • "days": the input data are daily.

  • "months": the input data are monthly.

  • "quarters": the input data are quarterly

  • "years": the input data are yearly.

Default to months.

output_time_granularity

character, time unit of the output data. The granularity supported is the same as for the input data:

  • "days": the output data will be on a daily scale.

  • "months": the output data will be on a monthly scale.

  • "quarters": the output data will be on a quarterly scale.

  • "years": the output data will be on yearly scale.

The output granularity must be bigger than the input granularity. Also, the output granularity must be consistent with the input granularity, meaning that the time conversion must be possible. E.g., it is possible to group quarters to years. It is not possible to group quarters to semesters. Default to quarters.

years

numeric, number of development years in the study.

calendar_period_extrapolation

character, whether a spline for calendar extrapolation should be considered in the cox model fit. Default is `FALSE`.

continuous_features_spline

logical, weather a spline for smoothing continuous features should be added.

degrees_cf

numeric, degrees of the spline for smoothing continuous features.

degrees_of_freedom_cf

numeric, degrees of freedom of the splines for smoothing continuous features.

degrees_cp

numeric, degrees of the spline for smoothing the calendar period effect.

degrees_of_freedom_cp

numeric, degrees of freedom of the splines for smoothing the calendar period effect.

Value

IndividualDataPP object. A list containing

  • full.data: data.frame. The input data after pre-processing.

  • starting.data: data.frame. The input data as they were provided from the user.

  • training.data: data.frame. The input data pre-processed for training.

  • conversion_factor: numeric. The conversion factor for going from input granularity to output granularity. E.g, the conversion factor for going from months to quarters is 1/3.

  • string_formula_i: character. The survival formula to model the data in input granularity.

  • string_formula_o: character. The survival formula to model the in data output granularity.

  • continuous_features: character. The continuous features names as provided from the user.

  • categorical_features: character. The categorical features names as provided from the user.

  • calendar_period_extrapolation: logical. The value specifying if a calendar period component is extrapolated.

  • years: numeric. Total number of development years in the data. Default is NULL and computed automatically from the data.

  • accident_period: character. Accident period column name.

  • calendar_period: character. Calendar_period column name.

  • input_time_granularity: character. Input time granularity.

  • output_time_granularity: character. Output time granularity.

After pre-processing, we provide a standard encoding for the time components. This regards the output in training.data and full.data. In the ReSurv notation:

  • AP_i: Input granularity accident period.

  • AP_o: Output granularity accident period.

  • DP_i: Input granularity development period in forward time.

  • DP_rev_i: Input granularity development period in reverse time.

  • DP_rev_o: Output granularity development period in reverse time.

  • TR_i: Input granularity truncation time.

  • TR_o: Output granularity truncation time.

  • I: event indicator, under this framework is equal to one for each entry.

Details

The input accident_period is coded as AP_i. The input development periods are derived as DP_i=calendar_period-accident_period+1.

The reverse time development factors are DP_rev_i = DP_max-DP_i, where DP_max is the maximum number of development times: DP_i \(=1,\ldots,\)DP_max. Given the parameter years, DP_max is derived internally from our package.

As for the truncation time, TR_i = AP_i-1.

AP_i, DP_i, DP_rev_i and TR_i are converted to AP_o, DP_o, DP_rev_o and TR_o (from the input_time_granularity to the output_time_granularity) using a multiplicative conversion factor. E.g., AP_o = AP_i * \(CF\).

The conversion factor is computed as

\(CF=\frac{{\nu}^i}{({\nu}^o)^{-1}}\),

where \({\nu}^i\) and \({\nu}^o\) are the fraction of a year corresponding to input_time_granularity and output_time_granularity. \({\nu}^i\) and \({\nu}^o\) take values 1/360, 1/12, 1/4, 1/2, 1 for "days", "months", "quarters", "semesters", "years" respectively. We will have RP_o = AP_o + DP_o.

References

Munir, H., Emil, H., & Gabriele, P. (2023). A machine learning approach based on survival analysis for IBNR frequencies in non-life reserving. arXiv preprint arXiv:2312.14549.