Abstract
This blogposts shows how to extract population estimates data reported by the Florida Department of Health and prepare them for analysis, specifically, for exploring the trends in demographic growth between 2006 and 2020.
Click HERE to view report in its native environment of the suicide-prevention-2019 repository.
Data Origin
The initial extract of the data was obtained from www.flhealthcharts.com, a reporting tool for population counts estimated by the Florida Department of Health. The figure below shows the modifications to the default query the produces the data product used in this demonstration:
The tool gives the option to save the product of the query as an Excel book (.xls
), however, the import of this extension into R has been problematic, so I have converted (“save as”) the file manually into a more modern Excel format, .xlsx
. This file is the raw source for the current report and can be dowloaded for closer inspection here.
Data import
The structure of the Excel file requires some tidying to enable a nimble analytic flow
Data Tweaking
We can identify several problems to address:
- Not all columns have names
race
,ethnicity
,sex
, andage_group
are stored in merged spreadsheet cells
- Sums for categories are recorded in rows as observations (e.g
Total
)
- Some values of
age_group
are misinterpreted as dates (e.g.1-4
becomes4-Jan
)
age_group
does not bin the population evenly (e.g.20-24
vs25-34
)
The cleaned version of this dataset is available for download here.
Graphing
We will consider several questions in order to demonstrate plotting from this cleaned dataset.
- How does the total population of Florida changes between 2006 and 2020?
- How does each ethnic group change during this time?
- what Ethnic group is most dissimilar from the other three in their dynamics?
- What would it look like if we used the original
age_group
?