Health data¶
Initially for our project with the Robert Woods Johnson Foundation (RWJF), these procedures outline the data collection of health-specific data.
Collect NIH¶
Extract all of the NIH World RePORTER data via
their static data dump. N_TABS
outputs are produced
in CSV format (concatenated across all years), where
N_TABS
correspondes to the number of tabs in
the main table found at:
The data is transferred to the Nesta intermediate data bucket.
-
get_data_urls
(tab_index)[source]¶ Get all CSV URLs from the
tab_index`th tab of the main table found at :code:`TOP_URL
.Parameters: tab_index (int) – Tab number (0-indexed) of table to extract CSV URLs from. Returns: Title of the tab in the table. hrefs (list): List of URLs pointing to data CSVs. Return type: title (str)
preprocess_nih¶
Data cleaning / wrangling before ingestion of raw data, specifically:
- Systematically removing generic prefixes using very hard-coded logic.
- Inferring how to correctly deal with mystery question marks, using very hard-coded logic.
- Splitting of strings into arrays as indicated by JSON the ORM,
- CAPS to Camel Case for any string field which isn’t VARCHAR(n) < 10
- Dealing consistently with null values
- explicit conversion to datetime of relevant fields
-
get_long_text_cols
[source]¶ Return the column names in the ORM which are a text type, (i.e. TEXT or VARCHAR) and if a max length is specified, with max length > 10. The length requirement is because we don’t want to preprocess ID-like or code fields (e.g. ISO codes).
-
is_nih_null
(value, nulls=('', []{}, 'N/A', 'Not Required', 'None'))[source]¶ Returns True if the value is listed in the nulls argument, or the value is NaN, null or None.
-
expand_prefix_list
[source]¶ Expand GENERIC_PREFIXES to include integers, and then a large numbers of permutations of additional characters, upper case and title case. From tests, this covers 19 out of 20 generic prefixes from either abstract text or the “PHR” field.
-
remove_generic_suffixes
(text)[source]¶ Iteratively remove any of the generic terms in expand_prefix_list from the front of the text, until none remain.
-
remove_large_spaces
(text)[source]¶ Iteratively replace any large spaces or tabs with a single space, until none remain.
-
replace_question_with_best_guess
(text)[source]¶ Somewhere in NiH’s backend, they have a unicode processing problem. From inspection, most of the ‘?’ symbols have quite an intuitive origin, and so this function contains the hard-coded logic for inferring what symbol used to be in the place of each ‘?’.
-
remove_trailing_exclamation
(text)[source]¶ A lot of abstracts end with ‘!’ and then a bunch of spaces.
-
upper_to_title
(text, force_title=False)[source]¶ Inconsistently, NiH has fields as all upper case. Convert to titlecase
-
detect_and_split
(value)[source]¶ Split values either by colons or commas. If there are more colons than commas (+1), then colons are used for splitting (this takes into account that NiH name fields are written as ‘last_name, first_name; next_last_name, next_first_name’). Otherwise NiH list fields are delimeted by commas.
Process NIH¶
Data cleaning and processing procedures for the NIH World Reporter data. Specifically, a lat/lon is generated for each city/country; and the formatting of date fields is unified.