Data Preparation

Beginner

Data Prep

Prepare raw distance-sampling data for analysis in Rdistance

Author

Trent McDonald

Published

December 12, 2025

Modified

December 11, 2025

Raw distance-sampling data needs to be prepped into a specialized data frame prior to analysis in Rdistance. Raw transect and detection data should be collected into data frames that Rdistance calls “RdistDf” data frames. RdistDf data frames contain transect and detection information in a single data structure, as well as other information necessary to estimate a distance function. Savvy analysts can prepare these specialized data frame on their own (see ?RdistDf), which might be efficient if data sets are related and analyses run via scripts. Otherwise, most analysts will use Rdistance’s built-in helper function RdistDf. This tutorial walks users through constructing an RdistDf data frame using RdistDf.

Overview

The function, Rdistance::RdistDf(), makes an RdistDf data frame from separate transect and detection data frames. RdistDf data frames are nested data frames that contain one row per transect. Detections made on each transect, whether line or point, appear in a list-based column that itself contains a data frame. In addition to transect and detection information, RdistDf data frames contain other bits of information necessary to carry out a distance analysis, such as transect type, number of detectors, transect ID’s, etc.

The next two sections discuss the separate transect and detection data frames that feed into RdistDf(). These separate data frames are the ones that analysts must construct on their own. The third section shows how to call RdistDf() and describes the resulting data frame.

Transect Information

Transect information consists of two mandatory pieces of information, and one optional piece of information. The two mandatory pieces of information are the transect ID’s and the amount of effort each transect represents. The optional piece of information consists of transect-level covariates that may or may not be used during analysis. The transect data frame is an ordinary R data frame with one row per transect and columns that hold the mandatory variables (transect ID, and effort) and optional transect-level covariates. A transect data frame containing 5 transects might look like this:

Listing 1: An example transect data frame.

siteDf

   area transect length observer
1 North        1    100      AAA
2 North        2    200      AAB
3 South        1    150      AAA
4 South        2    150      AAB
5 South        3    225      AAB

Unique transects are identified by unique combinations of area and transect. Variable length is each transect’s length in meters. Variable observer contains the observer’s initials (a transect-level covariate).

Transect ID’s

Each row in the transect data frame should represent one independent transect. Is is important for rows to be independent sampling units (i.e., transects) because rows of this data frame will later be re-sampled during bootstrapping and will form the basis of replication for confidence interval estimates.

At least one column in the transect data frame must uniquely identify rows. Rdistance does not use the data frame’s row names. Rows can be represented by unique combinations of multiple variables (like in Listing 1, columns area and transect).

Effort

The sampling effort represented by each transect should be represented in a single variable. If the transects are line-transects¹, the effort column must be length of the transect and measurement units (such as meters or kilometers or miles) must be assigned. If transects are point-transects², the effort column must contain the number of points on the transect. Point-transect effort values must be integers and cannnot have measurement units assigned.

Effort for line-transects is the 1-D length of each transect. Effort for point-transects is the number of points on the transect.

The following code assigns meters to the length column of the example transect data frame.

siteDf$length <- siteDf$length %m% .
siteDf

   area transect  length observer
1 North        1 100 [m]      AAA
2 North        2 200 [m]      AAB
3 South        1 150 [m]      AAA
4 South        2 150 [m]      AAB
5 South        3 225 [m]      AAB

Measurement Units

Rdistance requires physical measurement units on all variables that represent geographic space, such as distances and areas. Requiring measurement units does the following:

It relieves users from unit conversion, and associated conversion errors;
It ensures computations involving physical units are valid and conversions are correct (e.g., [m] + [hectares] is invalid, and [m] + [ft] is correct after first converting [ft] to [m]);
It ensures output is reported correctly.

In Rdistance physical measurement units can be assigned to any variable using one of three methods: the %#% operator, the setUnits() function, or using one of the direct assignment operators like %m%. Examples:

(x <- 5000 %#% "m")

5000 [m]

(x <- 5000 %m% .)

5000 [m]

(x <- setUnits(5000, "m"))

5000 [m]

# Once assigned, conversion is automatic
(y <- x %#% "km")

5 [km]

(y <- x %ft% .)

16404.2 [ft]

(y <- setUnits(x, "nmile"))

2.699784 [nmile]

units(y) <- "furlong"
y

24.8548 [furlong]

See ?unitHelpers for more information on setting units in Rdistance. See ?units::set_units for more information on setting units outside Rdistance. See ?units::valid_udunits() for a list of recognized units.

Shout-out to the units package! The units package is truly amazing and useful in many contexts.

Detection Information

Detection information consists of two mandatory, and two optional, pieces of information. The mandatory pieces of information consist of the transect ID’s and observation distance of each detected target. The two optional pieces of information consist of group size and detection-level covariates.

Like transect information, detection information should be gathered into an ordinary data frame containing one row per detection. Variables in this data frame must contain the mandatory transect ID on which the detection was made, and the observation distance, either off-transect or radial off-point distances. If groups containing \(>1\) individual are the detection targets (e.g., flocks, herds, etc.), group size should appear in one column. Any number of detection-level covariates can appear in the detection data frame. A detection data frame might look like this:

Listing 2: An example detection data frame.

detectDf

   area transect distance grpSize sex
1 North        2     20.0       1   M
2 South        1     30.0       1   F
3 South        1     15.0       2   F
4 South        1     25.5       5   F
5 South        3     30.0       1   M
6 South        3     17.5       3   F

In this example, area and transect are columns of the compound transect ID, distance is the perpendicular distance from the survey target’s initial location to the transect, grpSize is the number of survey targets sighted at that particular distance, and sex is the sex of the survey target (a detection-level covariate).

In the detection data frame, distances must have measurement units.

detectDf$distance <- detectDf$distance %m% .
detectDf

   area transect distance grpSize sex
1 North        2 20.0 [m]       1   M
2 South        1 30.0 [m]       1   F
3 South        1 15.0 [m]       2   F
4 South        1 25.5 [m]       5   F
5 South        3 30.0 [m]       1   M
6 South        3 17.5 [m]       3   F

‘Zero’ Transects

In Rdistance, all detections have an associated transect, but not all transects have associated detections. That is, the transect data frame must contain one row for every sampled transect, even those with no detections. ‘Zero’ transects are transects without detections.

In the example detection data (Listing 2), no targets were detected on transects (North,1) or (South,2). The ID’s for these transects should not appear in the detection data frame, but should appear in the transect data frame (Listing 1).

Making the RdistDf

The function RdistDf merges the transect and detection information, makes the nested data frame, and stores other information for later use. The default values for parameters of function RdistDf assume the following:

Transects are line-transects (parameter pointSurvey = FALSE).
Transect and detection data frames should be merged using their common columns (parameter by).
The effort column is named length. If not, effort is assumed to be the first variable whose name contains length (case insensitive) (e.g., transLength or line_LENGTH; but not transLen or len).
The type of observer system is single

Because the default assumptions are acceptable for our example, the call to RdistDf that makes our example RdistDf data frame is,

Listing 3: An example RdistDf nested data frame.

distDf <- RdistDf( 
    transectDf = siteDf
  , detectionDf = detectDf
  )
distDf

# A tibble: 5 × 5
# Rowwise:  area, transect
  area  transect         detections length observer
  <chr>    <dbl> <list<tibble[,3]>>    [m] <chr>   
1 North        2            [1 × 3]    200 AAB     
2 South        1            [3 × 3]    150 AAA     
3 South        3            [2 × 3]    225 AAB     
4 North        1                       100 AAA     
5 South        2                       150 AAB

In our example RdistDf data frame distDf, area and transect are the compound transect ID, detections is a variable containing the nested data frames (technically, nested tibbles) that in turn contain detections (and group size and detection-level covariates) associated with the transect represented on that row, length is length of the transect, and observer is the transect-level covariates. All variables except detections are repeats of the original transect data frame. The detections column shows 1 detection on transect (North,2), 3 on (South,1), 2 on (South 3), and no detections on (North,1) or (South,2). These counts are the first number in the “[a x b]” string printed for column detections and represent the number of rows in the nested data frame. No nested data frame exists for ‘zero’ transects.

Function RdistDf first nested all detections on the same transect into data frames, then found the common variables between transect and detection data frames (area and transect in our example), and merged the data frames together using a 1-to-many (i.e., “left”) join.

If Default Parameters Don’t Work

In some cases, the default parameters of RdistDf will not work. Analysts will then need to override defaults and the correct information. The parameters of RdistDf, beyond the transect and detection data frames, are as follows:

pointSurvey: Either TRUE or FALSE. TRUE if observations were made on point-transects surveys, in which case distances are radial from observation point to target and the effort column cannot have measurement units. FALSE if observations were made along line-transects where distances are from target to nearest point on the transect (i.e., perpendicular to transect) and the effort column must contain measurement units.
observer: The type of observer system. The observer system is either “single” for single observer systems, or “1given2”, or “2given1”, or “both” for double observer systems.
.detectionCol: The desired name of the list column that will contain detection data frames. The default name is “detections”.
.effortCol: Name of the effort column in the transect information data frame (i.e., length in Listing 1). The default is “length” for line-transects, and “numPoints” for point-transects. If those names are not found, the first column in the transect data frame whose name contains ‘point’ (for point transects) or ‘length’ (for line transects) is used and a message is printed. Matching is case insensitive, so for example, ‘nPoints’ and ‘N_point’ and ‘numberOfPoints’ will all be matched, but ‘numPts’ or ‘pts’ will not. If two or more column names match the effort column search terms, a warning is issued.
by: A character vector of variable names to use when joining the transect and detection data frames. The left-hand side of the transect-detection data fame join identifies unique transects (unique rows) in the transect data frame, and the joins is 1-to-many. If NULL, the transect-detection join will be ‘natural’, which uses all common variables in the transect and detection data frames. To join on specific variables, specify a character vector. For example, by = c("a", "b") joins the transect variable a to detection variable a and transect variable b to detection variable b. If join variable names differ between the transect and detection data frames, by is a named character vector like by = c(“a” = “b”, “c” = “d”) which joins transect variable a to detection variable b and transect variable c to detection variable d.

A call to RdistDf, which is equivalent to the one in Listing 3, but with all parameters specified, is,

distDf <- RdistDf( 
    transectDf = siteDf
  , detectionDf = detectDf
  , pointSurvey = FALSE
  , by = c("area", "transect")
  , observer = "single"
  , .detectionCol = "detections"
  , .effortCol = "length"
  )
distDf

# A tibble: 5 × 5
# Rowwise:  area, transect
  area  transect         detections length observer
  <chr>    <dbl> <list<tibble[,3]>>    [m] <chr>   
1 North        2            [1 × 3]    200 AAB     
2 South        1            [3 × 3]    150 AAA     
3 South        3            [2 × 3]    225 AAB     
4 North        1                       100 AAA     
5 South        2                       150 AAB

RdistDf Data Frames

RdistDf data frames are technically grouped tibbles with a list column containing additional tibbles (tibbles are generalizations of base data frames, but behave much like regular data frames). Survey type, observer system, and name of the effort column are recorded as attributes. RdistDf data frame’s print method is from the tibble package, as seen in Listing 3.

Summary Method

The summary method for RdistDf data frames prints transect type, number of transects, and total length.

summary(distDf)

Transect type: line
Effort:
       Transects: 5       
    Total length: 825 [m] 
Specify 'formula', 'w.lo', and 'w.hi' to obtain distances, groups, and individuals.

Additional information relevant to distance analyses is printed if a formula (and potentially w.lo and w.hi) argument is specified. The formula argument specifies the distance column on the left and group sizes on the right (see ?dfuncEstim for more details on how to specify a distance function formula).

summary(distDf
        , formula = distance ~ groupsize(grpSize)
        )

Transect type: line
Effort:
       Transects: 5       
    Total length: 825 [m] 
Distances:
   0 [m] to 30 [m]: 6
Sightings:
         Groups: 6 
    Individuals: 13

Tester Function

Function is.RdistDf checks whether a data frame is a proper RdistDf data frame. If parameter verbose is TRUE, is.RdistDF will print the reason why a data frame fails to be an RdistDf data frame (verbose = TRUE prints nothing if the input is a valid RdistDf).

is.RdistDf(distDf)

[1] TRUE

is.RdistDf(siteDf, verbose = TRUE) # not a full RdistDf

siteDf must have a 'detectionColumn' attribute naming a list-based column that contains detection information. Assign attributes with statements like attr(siteDf,'detectionColumn') <- <list column> See help('RdistDf').

[1] FALSE

Expansion Methods

At times, analysts will want to inspect the unnested data frame. The unnested data frame is a regular R data frame containing one row per detection, and hence multiple rows per transect.

There are two ways to un-nest depending on whether the analyst wants to see zero transects or not. The unnest function in Rdistance includes the zero transects with NA for detections. The unnest function in the tidyr package does not include zero transects.

unnest(distDf)  # when Rdistance is attached and distDf is a RdistDf

# A tibble: 8 × 7
  area  transect length observer distance grpSize sex  
  <chr>    <dbl>    [m] <chr>         [m]   <dbl> <chr>
1 North        1    100 AAA          NA        NA <NA> 
2 North        2    200 AAB          20         1 M    
3 South        1    150 AAA          30         1 F    
4 South        1    150 AAA          15         2 F    
5 South        1    150 AAA          25.5       5 F    
6 South        2    150 AAB          NA        NA <NA> 
7 South        3    225 AAB          30         1 M    
8 South        3    225 AAB          17.5       3 F

tidyr::unnest(distDf, cols = "detections")  # no zero transects

# A tibble: 6 × 7
# Groups:   area, transect [3]
  area  transect distance grpSize sex   length observer
  <chr>    <dbl>      [m]   <dbl> <chr>    [m] <chr>   
1 North        2     20         1 M        200 AAB     
2 South        1     30         1 F        150 AAA     
3 South        1     15         2 F        150 AAA     
4 South        1     25.5       5 F        150 AAA     
5 South        3     30         1 M        225 AAB     
6 South        3     17.5       3 F        225 AAB

Individual rows can be un-nested as follows:

distDf$detections[2][[1]]  # nested detection data frame on row 2

# A tibble: 3 × 3
  distance grpSize sex  
       [m]   <dbl> <chr>
1     30         1 F    
2     15         2 F    
3     25.5       5 F

# Or, use dplyr::reframe, which preserves the IDs 
distDf |> 
  dplyr::filter(area == "South" & transect == 1) |>
  dplyr::reframe(detections)

# A tibble: 3 × 5
  area  transect distance grpSize sex  
  <chr>    <dbl>      [m]   <dbl> <chr>
1 South        1     30         1 F    
2 South        1     15         2 F    
3 South        1     25.5       5 F

Other Helper Routines

transectType(distDf)

[1] "line"

observationType(distDf)

[1] "single"

Attributes

attributes(distDf)

$class
[1] "rowwise_df" "tbl_df"     "tbl"        "data.frame"

$row.names
[1] 1 2 3 4 5

$names
[1] "area"       "transect"   "detections" "length"     "observer"  

$groups
# A tibble: 5 × 3
  area  transect       .rows
  <chr>    <dbl> <list<int>>
1 North        2         [1]
2 South        1         [1]
3 South        3         [1]
4 North        1         [1]
5 South        2         [1]

$detectionColumn
[1] "detections"

$obsType
[1] "single"

$transType
[1] "line"

$effortColumn
[1] "length"

Counts

The following code constructs a data frame containing the number of detections on each transect.

# cannot use dplyr::if_else here
nDetections <- distDf |> 
  dplyr::reframe(nDetections = ifelse(is.null(detections), 0, nrow(detections)))
nDetections

# A tibble: 5 × 3
  area  transect nDetections
  <chr>    <dbl>       <dbl>
1 North        2           1
2 South        1           3
3 South        3           2
4 North        1           0
5 South        2           0

sum(nDetections$nDetections) # Number of detections

[1] 6

sum(nDetections$nDetections == 0) # Number of zero transects

[1] 2

Footnotes

Line-transects are paths through the study area on which targets can be sighted at any point. Line-transects are often aerial flight lines, roads, paths, etc. and observers continually search for survey targets as they traverse the route.↩︎
Point-transects are paths through the study area consisting of a series of observation points. Each point is a station at which observers search for survey targets, and detections can only occur at these locations. Point-transects may consist of single or multiple points, depending on study design.↩︎