You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I went through the IPUMSR package tutorials to understand what kinds of functionalities are included. I created a simple list, and we can now identify what features to include in an initial or subsequent releases.
reading IPUMS data.
IPUMs data generally comes in multiple files, such as with a DDI file and data file.
With NHGIS files, there is only one file that contains the metadata and data, though
there can be discrepancies between the metadata in the NHGIS file and the corresponding
DDI file. Further, users can often download multiple data files and would like to
process these files as a batch.
submit IPUMS request using API key, and providing variable names and dataset id.
wait for IPUMS processing to complete and indicate readiness to download.
download IPUMS data from website.
create the ability to use a YML or TOML file to save IPUMS request settings. This will make it easier to replicate subsequent downloads and debug errors in variable names, etc. (IPUMsPy package).
read the microdata extract with DDI (.xml) file -- similar to read_ipums_micro() function.
pretty print a summary of the variables, data types, labels, etc., -- similar to ipums_var_info() function.
get the attributes of each column -- similar to attributes(cps_data$MONTH) function.
The categories correspond to factors/labels.
save/include data quality flags from download.
get labels corresponding to each column -- similar to ipums_val_labels(cps_data$MONTH) function.
Labels are a particular implementation of factors, and are distinct from standard R factors.
read the DDI (.xml) file alone. Often this file contains the value labels and such.
Some of this stuff gets stripped while processing data.
hierarchical extracts? mixed record types in a single file.
read IPUMs NHGIS extracts (combined DDI and data file).
read/summarize NHGIS codebook read_nhgis_codebook(). This can be different than label info in the conventional extract.
handling multiple files, such as multiple NHGIS files in a zipped archive.
The user should be able to filter files for variables, or select individual files from the combined files.
read CSV, fixed-width formats in NHGIS.
read spatial data
read nghis shapefile read_ipums_sf()
select from multiple shape files, if multiple shape files are selected
join tabular data to IPUMs spatial data on GISJOIN variable
distinguish between harmonized and non-harmonized data. Harmonized data
has been adjusted for changes in geometry over time.
IPUMs value labels (categorical variables encoded as numbers) - based on haven() R package.
Note that often IPUMS data columns have variable labels that are human readable variables names
as opposed to esoteric column names, like household_income versus HV001_a. Data extracts may also
contain variable descriptions, which are text descriptions of the contents of a variable.
Finally, extracts may contain value labels which are categorical encodings like R factors, eg.
like 1 = Excellent, 2 = very good. The IPUMSR package uses the labelled() class from the haven() package. The data type for a column would be say <int+lbl> to indicate that there are
equivalent forms of the data. The Julia design does not need to be identical, but there should be
a way to identify when columns have labeled/categorical data, and to identify the values of those
labels.
list the labels that correspond to a column
properties of labeled variables.
they don't require all values to be labelled
they don't require value to be assigned to increasing integers starting at 1
We have to test and make sure computations are done correctly with labelled levels versus purely factors.
in R, often label info is stripped out of functions. Labels are generally used to initially prepare the data for subsequent processing.
need to be able to define the variables in a dataset as labeled. Also need to be able to remove the labels.
define missing values based upon labels, like lbl_na_if()
lbl_relabel() function to relabel a variable based on criteria.
lbl_clean() to remove unused labels from data view.
lbl_add() to add a new label.
lbl_define() to make a labelled vector out of an unlabelled one.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I went through the IPUMSR package tutorials to understand what kinds of functionalities are included. I created a simple list, and we can now identify what features to include in an initial or subsequent releases.
IPUMs data generally comes in multiple files, such as with a DDI file and data file.
With NHGIS files, there is only one file that contains the metadata and data, though
there can be discrepancies between the metadata in the NHGIS file and the corresponding
DDI file. Further, users can often download multiple data files and would like to
process these files as a batch.
IPUMsPy
package).read_ipums_micro()
function.ipums_var_info()
function.attributes(cps_data$MONTH)
function.The categories correspond to factors/labels.
ipums_val_labels(cps_data$MONTH)
function.Labels are a particular implementation of factors, and are distinct from standard R factors.
Some of this stuff gets stripped while processing data.
has been adjusted for changes in geometry over time.
haven()
R package.Note that often IPUMS data columns have variable labels that are human readable variables names
as opposed to esoteric column names, like household_income versus HV001_a. Data extracts may also
contain variable descriptions, which are text descriptions of the contents of a variable.
Finally, extracts may contain value labels which are categorical encodings like R factors, eg.
like
1 = Excellent, 2 = very good
. The IPUMSR package uses thelabelled()
class from thehaven()
package. The data type for a column would be say<int+lbl>
to indicate that there areequivalent forms of the data. The Julia design does not need to be identical, but there should be
a way to identify when columns have labeled/categorical data, and to identify the values of those
labels.
lbl_relabel()
function to relabel a variable based on criteria.lbl_clean()
to remove unused labels from data view.lbl_add()
to add a new label.lbl_define()
to make a labelled vector out of an unlabelled one.Beta Was this translation helpful? Give feedback.
All reactions