Automated Approach to PK and PKPD Dataset Assembly


This repository contains several R functions to support the assembly of PK(PD) datasets to be used in NONMEM. The functions will create a PK(PD) dataset with consistent column names and covariate labels. Additional functions are used to support covariate analysis, combine datasets to form a population dataset, and create dataset definition files to support regulatory submissions.

Instructions for Use

Download the Repository onto your Computer

The latest CRAN release can be installed with the following commands:


The current development package can be downloaded from GitHub with the following commands:


Function summary


pk_build() creates a PK(PD) dataset for analysis in NONMEM from source data. The functions is not intended to produce datasets for NCA.
The function automatically maps CDISC terminology to a uniform variable name (apmx name) appropriate for pharmacometric analysis.
The function issues a variety of warnings and errors to inform the user of problematic subjects and records.

General comments:
* The ex and pc domain may accept standard CDISC attribute names or apmx names. apmx names are required for pd events.
* DTIM (the date/time of the record) must be in ISO-8601 format to be processed correctly. All date/times assumed to be UTC. Accepted forms:
+ YYYY:mm:dd HH:MM:SS
+ YYYY:mm:ddTHH:MM
+ YYYY:mm:dd HH:MM
+ YYYY:mm:dd (this format is not accepted for ex, pc, or pd events)
* All covariates are automatically renamed based on type and categorical covariates are automatically mapped to a numeric type in the following manner:
+ All character-type covariates are considered categorical. The given covariate will be mapped to a numeric value and the column name will start with a prefix “N” (subject-level) or “T” (time-varying). The character description is retained and the column name will end with the suffix “C”. For example, an input subject-level covariate “SEX” will mapped to “NSEX” (numeric) and “NSEXC” (character).
+ All numeric-type covariates are considered continuous. The covariate column name will start with a prefix “B” (baseline) or “T” (time-varying).
+ All numeric-type covariates must also have an associated character column for units. For example, an input dataframe with covariate “AGE” must have an accompanying column “AGEU”.
* The study label STUDY must be provided in either the ex domain or sl.cov domain.
* Missing date/times can be handled three different ways with the impute parameter.
+ impute can be left empty, which will not impute times for any event missing DTIM.
+ impute = 1 will set all actual time variables equal to the nominal time for events where DTIM is missing. This method is appropriate for pre-clinical analysis when actual times may not be collected.
+ impute = 2 will estimate actual time variables relative to other events occurring that day for events where DTIM is missing. This method is appropriate for phase I-III analyses when individual events are missing a date/time.

apmx attribute names and definitions:
* USUBJID: unique subject ID (character or numeric)
* DTIM: date-time of the event (character)
* NDAY: nominal day of event, derived from protocol-defined visit (numeric) + Day of first dose should start at nominal (study) day 1, not day 0.
+ Day prior to first dose should be documented as day -1, not day 0.
* TPT: nominal timepoint of event, derived from protocol-defined timepoint (numeric)
* VISIT: visit label (character)
* TPTC: timepoint label (character)
* ROUTE: dose route description (character)
* FRQ: dose frequency description (character)
* AMT: administered dose amount for dose events only (numeric)
* CMT: compartment assignemt for each event (numeric)
* DVID: dependent variable label (character)
+ dose events should have the same label as the parent PK metabolite
* DVIDU: dependent variable unit label (character)
+ dose events should have the dose units listed in this variable
* ODV: original dependent variable (numeric)
* LLOQ: lower limit of quantification (numeric)
* STUDY: study label (character)

apmx derived attribute names and definitions:
* SUBJID: numeric form of USUBJID
* ID: ID counting variable
* ATFD: actual time since first dose
* ATLD: actual time since last (most recent) dose
* NTFD: nominal time since first dose
* NTLC: nominal time since last cycle
* NTLD: nominal time since last (most recent) dose
* EVID: event ID (NONMEM-required)
* MDV: missing dependent variable (NONMEM-required) * DVID: numeric mapping of input DVID
* LDV: log-transformed dependent variable
* BDV: baseline dependent variable (for PD events only)
* DDV: delta from baseline dependent variable (for PD events only)
* PDV: percent change from baseline dependent variable (for PD events only)
* BLQ: flag for BLQ records
+ BLQ = 0 when observation is not BLQ
+ BLQ = 1 when observation is BLQ and prior to first dose
+ BLQ = 2 when observation is BLQ and after first dose
* DOSEA: most recently administered dose amount
* DOMAIN: event domain * DVIDC: character label for DVID
* TIMEU: units for all time variables
* FDOSE: date/time of first dose
* VERSN: apmx package version number * BUILD: date of dataset construction
* COMBD: date of dataset combination (generated from apmx::pk_combine() only)

apmx derived binary flag names and definitions:
* PDOSEF: records that occur prior to first dose
* TIMEF: records with no TIME information
* AMTF: dose events with missing AMT
* DUPF: duplicated events within the same ID-ATFD-EVID-CMT
* NOEXF: subjects with no dose events
* NODVNF: subjects with no observations in compartment n (one flag per observation compartment)
* SDF: subjects who are single-dose (as opposed to multi-dose)
* PLBOF: records where the most recent dose is placebo
* SPARSEF: records associated with sparse sampling (as opposed to serial)
* TREXF: dose records that trail the final observation record
* IMPEX: time of last (most recent) dose event was imputed
* IMPDV: time of observation event was imputed
* IMPFEX: time of first dose event was imputed
* C: comment flag to indicate a record will be ignored in the analysis


pk_combine() will combine two PK(PD) datasets built by pk_build() to form a population dataset.
All categorical covariates are re-calculated to ensure they are consistently labeled throughout the population analysis.
The function confirms the analytes and compartments are in agreement between both studies.


cov_find() will identify the columns in a PK(PD) dataset that belong to a certain covariate class.
Covariates can be “categorical”, “continuous”, “exposure”, “empirical bayes estimate”, or “other”.
Types can be “numeric” or “character”.


cov_apply() will add additional covariates to a PK(PD) dataset already built by pk_build().
It can apply covariates of any type, either subject-level or time-varying.
Covariates can be merged by any ID variable (USUBJID, SUBJID, or ID) or any time variable (ATFD, ATLD, NTFD, NTLC, NTLD, NDAY, TPT).
The same prefix and suffix system is applied to covariates built with cov_apply()
* Exposure metrics can be added with this function and will receive the prefix “C”
* Empirical bayes estimates can be added with this functoin and will receive the prefix “I”


pk_write() will write out a PK(PD) dataset as a .csv file to the filepath of your choice in a NONMEM-ready format.


pk_define() will create a definition file for a PK(PD) dataset built by pk_build(). It can be produced as a data frame in R or exported as a Word document. The function requres a dataset and a list of variable definitions. For more information on this list, refer to variable_list_create().

Other arguments: * file: filepath to a .docx file to specify the file name and location of exported Word document * project: character string to specify project name. This name will appear in the header of the definition file if the template contains the word “Project” in the header.
* data: character string to specify dataset name. This name will appear in the header of the definition file if the template contains the word “Dataset” in the header.
* variable.list: data frame of variable definitions you wish to use. The variable list should have the following columns in this order:
+ Variable: variable name (covariates just need the root term for proper definitions. For example, the variable for covariates “NSEX” and “NSEXC” only need to be listed once as “SEX”)
+ Category: desired variable category
+ Description: desired variable description (covariates are automatically detected as “subject-level” or “time-varying” and labeled as such, you only need to provide the root definition. For example, the description for “SEX” can be listed “sex”. For NSEX, the definition file would read “Subject sex”.)
+ Comment: desired comment
* template: optional filepath to template .docx document you wish to use. The definition table will append to the end of the document. If you leave the template blank, the definition table will read into a blank document, and project and data parameters will be ignored.


pk_summarize() will create summary tables (BLQ, categorical covariates, and continuous covariates) of the dataset created by pk_build() or pk_combine().
Summary tables can be exported as .csv files, .docx files, and/or .pptx files. They can be stratified by any variable in the dataset. You can filter the dataset prior to summary statistics.


version_log() will create a version log of datasets created by pk_build() or pk_combine(). The first time you call the function, it will create an initial entry. You can export the version log as a Word document. Then, after you create a new, updated dataset, you can call the version_log() again. If you provide the filepath to the current version log, the new dataset, and the old dataset, the version log will add the new dataset and provide a brief summary comparison of the two.


variable_list_create() is a helper function for pk_define() that creates a standard variable list. The standard apmx variable names and definitions are already included. You can add your own variables (custom covariates, etc.) so they are included in the definition file.

Future Development

In addition to bug fixes and runtime improvements, future functions will focus on the following areas:
* Assembly of datasets to support QTC-prolongation analysis
* Assembly of datasets to support TTE and other ER analyses
* Additional dataset QC and documentation tools
* Auxiliary functions to support dataset assembly

Authors and Acknowledgements in Alphabetical Order

Mildred Afoumbom
Joyceline Afumbom
Stephen Amori
Ethan DellaMaestra
Michael Dick
Ekiti Ekote
Daniel Litow
Jonah Lyon