QA for merging

Purpose

This page summarizes which screening/demographics data and which home visit data have anomalies or missing data.

Screening/demographic data

screen_df <-
  readr::read_csv(
    file.path(
      here::here(),
      "data",
      "csv",
      "screening",
      "agg",
      "PLAY-screening-datab-latest.csv"
    ),
    col_types = readr::cols(.default = 'c'),
    show_col_types = FALSE
  )

The following rows have incomplete or missing site_id values:

screen_df |>
  dplyr::filter(is.na(site_id) | is.null(site_id)) |>
  dplyr::select(site_id, vol_id, participant_ID, session_id, play_id, group_name) |>
  dplyr::arrange(vol_id, site_id, participant_ID) |>
  knitr::kable() |>
  kableExtra::kable_classic()
site_id vol_id participant_ID session_id play_id group_name
NA 1103 001 44638 NA PLAY_Silver
NA 1481 006 64916 NA PLAY_Silver
NA 1576 003 64939 NA PLAY_Silver
NA 1656 001 70116 NA PLAY_Silver
NA 954 001 39302 NA PLAY_Silver

Volume 1103 is OHIOS. Volume 1482 is CSUFL. Volume 954 is GEORG. The missing values for vol_id indicate that there is a bug in the cleaning code.

2023-10-20

On closer investigation, the screening data do not show an OHIOS session with participant_ID == ‘001’. There are three with ‘000’ and two with ‘002’.

Similarly, for CSUFL, there is no ‘003’ or ‘006’. We have home visit data for ‘006’.

Similarly, for GEORG, there is a ‘???’, but no ‘001’.

The following rows have incomplete or missing participant_ID values:

screen_df |>
  dplyr::filter(is.na(participant_ID) | is.null(participant_ID)) |>
  dplyr::select(site_id, vol_id, participant_ID, session_id, play_id, group_name) |>
  dplyr::arrange(vol_id, site_id, participant_ID) |>
  knitr::kable() |>
  kableExtra::kable_classic()
site_id vol_id participant_ID session_id play_id group_name
NA NA NA NA NA NA
:——- :—— :————– :———- :——- :———-

The following rows have incomplete or missing play_id values:

screen_df |>
  dplyr::filter(is.na(play_id) | is.null(play_id)) |>
  dplyr::select(site_id, vol_id, participant_ID, session_id, play_id, group_name) |>
  dplyr::arrange(vol_id, site_id, participant_ID) |>
  knitr::kable() |>
  kableExtra::kable_classic()
site_id vol_id participant_ID session_id play_id group_name
UCSCR 1066 001 56051 NA PLAY_Gold
UCSCR 1066 002 56073 NA PLAY_Gold
UCSCR 1066 003 56321 NA PLAY_Gold
UCSCR 1066 005 57998 NA PLAY_Gold
UCSCR 1066 009 58358 NA PLAY_Gold
UCSCR 1066 010 58466 NA PLAY_Gold
UCSCR 1066 011 58472 NA PLAY_Gold
UCSCR 1066 012 58477 NA PLAY_Gold
UCSCR 1066 014 58805 NA PLAY_Silver
UCSCR 1066 015 59804 NA PLAY_Gold
UCSCR 1066 016 59805 NA PLAY_Gold
OHIOS 1103 002 56674 NA PLAY_Gold
OHIOS 1103 002 56674 NA PLAY_Gold
OHIOS 1103 005 57182 NA PLAY_Gold
OHIOS 1103 006 58204 NA PLAY_Gold
OHIOS 1103 008 58230 NA PLAY_Gold
OHIOS 1103 009 57371 NA PLAY_Gold
OHIOS 1103 010 57212 NA PLAY_Gold
OHIOS 1103 011 57324 NA PLAY_Gold
OHIOS 1103 012 58231 NA PLAY_Gold
OHIOS 1103 014 58232 NA PLAY_Gold
OHIOS 1103 015 58641 NA PLAY_Gold
OHIOS 1103 016 58315 NA PLAY_Gold
OHIOS 1103 017 58642 NA PLAY_Gold
OHIOS 1103 018 58724 NA PLAY_Gold
OHIOS 1103 019 58725 NA PLAY_Gold
OHIOS 1103 021 58747 NA PLAY_Gold
OHIOS 1103 023 59001 NA PLAY_Gold
OHIOS 1103 024 59029 NA PLAY_Gold
OHIOS 1103 025 59109 NA PLAY_Gold
OHIOS 1103 026 59802 NA PLAY_Gold
OHIOS 1103 027 59806 NA PLAY_Gold
OHIOS 1103 028 59820 NA PLAY_Gold
OHIOS 1103 029 59858 NA PLAY_Gold
OHIOS 1103 030 59966 NA PLAY_Gold
NA 1103 001 44638 NA PLAY_Silver
STANF 1362 001 57209 NA PLAY_Silver
STANF 1362 002 58017 NA PLAY_Gold
PURDU 1363 001 56367 NA PLAY_Silver
PURDU 1363 003 58740 NA PLAY_Gold
PURDU 1363 004 58918 NA PLAY_Gold
PURDU 1363 005 59150 NA PLAY_Silver
PURDU 1363 006 60049 NA PLAY_Silver
PURDU 1363 006 60049 NA PLAY_Silver
CHOPH 1370 001 57863 NA PLAY_Silver
CHOPH 1370 002 60897 NA PLAY_Gold
CSULB 1376 001 56400 NA PLAY_Gold
CSULB 1376 002 56399 NA PLAY_Gold
CSULB 1376 003 57852 NA PLAY_Gold
CSULB 1376 004 58612 NA PLAY_Gold
CSULB 1376 005 59735 NA PLAY_Silver
CSULB 1376 007 57857 NA PLAY_Gold
CSULB 1376 010 60080 NA PLAY_Gold
VBLTU 1391 001 59779 NA PLAY_Silver
VBLTU 1391 003 60236 NA PLAY_Gold
VBLTU 1391 004 60243 NA PLAY_Gold
VBLTU 1391 005 60311 NA PLAY_Gold
UHOUS 1397 001 57374 NA PLAY_Silver
UHOUS 1397 002 57916 NA PLAY_Gold
UHOUS 1397 004 58465 NA PLAY_Gold
UHOUS 1397 005 59144 NA PLAY_Gold
UHOUS 1397 006 60333 NA PLAY_Gold
UHOUS 1397 007 61697 NA PLAY_Gold
UHOUS 1397 008 61748 NA PLAY_Gold
INDNA 1400 001 58458 NA PLAY_Silver
INDNA 1400 002 62176 NA PLAY_Gold
UIOWA 1422 001 57544 NA PLAY_Gold
UIOWA 1422 002 58798 NA PLAY_Gold
UIOWA 1422 003 59206 NA PLAY_Gold
UIOWA 1422 004 59892 NA PLAY_Gold
UIOWA 1422 005 60749 NA PLAY_Silver
CSUFL 1481 001 60393 NA PLAY_Silver
NA 1481 006 64916 NA PLAY_Silver
NA 1576 003 64939 NA PLAY_Silver
NA 1656 001 70116 NA PLAY_Silver
NYUNI 899 001 41534 NA PLAY_Gold
NYUNI 899 001 41534 NA PLAY_Gold
NYUNI 899 002 41800 NA PLAY_Silver
NYUNI 899 002 41800 NA PLAY_Silver
NYUNI 899 003 41455 NA PLAY_Silver
NYUNI 899 003 41455 NA PLAY_Silver
NYUNI 899 003 41455 NA PLAY_Silver
NYUNI 899 004 41535 NA PLAY_Gold
NYUNI 899 005 41608 NA PLAY_Gold
NYUNI 899 005 41608 NA PLAY_Gold
NYUNI 899 006 41808 NA PLAY_Gold
NYUNI 899 006 41808 NA PLAY_Gold
NYUNI 899 007 41894 NA PLAY_Gold
NYUNI 899 007 41894 NA PLAY_Gold
NYUNI 899 013 43207 NA PLAY_Gold
NYUNI 899 014 43530 NA PLAY_Gold
NYUNI 899 017 55842 NA PLAY_Gold
NYUNI 899 018 55863 NA PLAY_Silver
NYUNI 899 020 56064 NA PLAY_Silver
NYUNI 899 021 56065 NA PLAY_Gold
NYUNI 899 022 56103 NA PLAY_Gold
NYUNI 899 023 56104 NA PLAY_Gold
NYUNI 899 024 56311 NA PLAY_Silver
NYUNI 899 026 56417 NA PLAY_Silver
NYUNI 899 028 56526 NA PLAY_Gold
NYUNI 899 029 56571 NA PLAY_Gold
NYUNI 899 031 57373 NA PLAY_Silver
NYUNI 899 032 57410 NA PLAY_Gold
NYUNI 899 035 57894 NA PLAY_Gold
NYUNI 899 037 59428 NA PLAY_Gold
NYUNI 899 227 38196 NA PLAY_Gold
NYUNI 899 228 38197 NA PLAY_Gold
NYUNI 899 229 38215 NA PLAY_Gold
NYUNI 899 230 38236 NA PLAY_Gold
NYUNI 899 231 38485 NA PLAY_Gold
GEORG 954 002 40510 NA PLAY_Silver
GEORG 954 002 40510 NA PLAY_Silver
GEORG 954 003 41417 NA PLAY_Gold
GEORG 954 003 41417 NA PLAY_Gold
GEORG 954 004 41428 NA PLAY_Gold
GEORG 954 005 41873 NA PLAY_Gold
GEORG 954 005 41873 NA PLAY_Gold
GEORG 954 005 41873 NA PLAY_Gold
GEORG 954 008 42127 NA PLAY_Gold
GEORG 954 008 42127 NA PLAY_Gold
GEORG 954 009 42354 NA PLAY_Gold
GEORG 954 009 42354 NA PLAY_Gold
GEORG 954 010 42353 NA PLAY_Gold
GEORG 954 011 42694 NA PLAY_Gold
GEORG 954 012 57864 NA PLAY_Gold
GEORG 954 015 58803 NA PLAY_Gold
GEORG 954 016 59778 NA PLAY_Silver
GEORG 954 021 60059 NA PLAY_Gold
NA 954 001 39302 NA PLAY_Silver
UCRIV 966 001 43624 NA PLAY_Gold
VCOMU 982 002 42292 NA PLAY_Silver
VCOMU 982 003 42940 NA PLAY_Gold
UMIAM 996 003 65077 NA PLAY_Gold

There are duplicate entries for OHIOS 1103 56674 002.

We should add duplicate checking to the cleaning code.

There are 1 variables with information about exclusion status:

ex_dups <- stringr::str_detect(names(screen_df), "exclusion")
names(screen_df)[ex_dups]
## [1] "exclusion_reason"

These probably result from some bug in the *_join operation in the cleaning process. They should be merged.

Home visit data

targets::tar_load(home_visit_df, store=file.path(here::here(), "_targets"))

The home_visit_df data have some field names that are inconsistent with the screening/demographic data files. We reconcile these differences first.

home_df <- home_visit_df |>
    dplyr::rename("play_id" = "participant_id") |>
    dplyr::rename("participant_ID" = "subject_number")

Since the home_visit_df data have not yet been merged with Databrary information, the vol_id and group_name variables are not available.

The following rows have incomplete or missing site_id values:

home_df |>
  dplyr::filter(is.na(site_id) | is.null(site_id)) |>
  dplyr::select(site_id, participant_ID, play_id) |>
  dplyr::arrange(site_id, participant_ID) |>
  knitr::kable() |>
  kableExtra::kable_classic()
site_id participant_ID play_id
NA NA NA
:——- :————– :——-

The following rows have incomplete or missing participant_ID values:

home_df |>
  dplyr::filter(is.na(participant_ID) | is.null(participant_ID)) |>
  dplyr::select(site_id, participant_ID, play_id) |>
  dplyr::arrange(site_id, participant_ID) |>
  knitr::kable() |>
  kableExtra::kable_classic()
site_id participant_ID play_id
NA NA NA
:——- :————– :——-

The following rows have incomplete or missing play_id values:

home_df |>
  dplyr::filter(is.na(play_id) | is.null(play_id)) |>
  dplyr::select(site_id, participant_ID, play_id) |>
  dplyr::arrange(site_id, participant_ID) |>
  knitr::kable() |>
  kableExtra::kable_classic()
site_id participant_ID play_id
NA NA NA
:——- :————– :——-