Bayesian Data Analysis Global South (GSU) 2023
Project work involves choosing a data set and performing a whole analysis according to all the parts of Bayesian workflow studied along the course. In this course instance there are no project presentations, but you will get feedback from your peers. You can do the project work in groups if you like.
In this course instance the project work can be done in groups of 1-4 persons, but you don’t need to find a group.
If you don’t have a group, you can ask other students in the group chat channel #project. Tell what kind of data you are interested in (e.g. medicine, health, biological, engineering, political, business), whether you prefer R or Python, and whether you have already more concrete idea for the topic.
In this course instance the project work’s evaluation consists of only from
In the project report you practice presenting the problem and data analysis results, which means that minimal listing of code and figures is not a good report. There are different levels for how data analysis project could be reported. This report should be more than a summary of results without workflow steps. While describing the steps and decisions made during the workflow, to keep the report readable some of the diagnostic outputs and code can be put in the appendix. If you are uncertain you can ask TAs in TA sessions whether you are on a good level of amount of details.
The report should not be over 20 pages and should include
As some data sets have been overused for these particular goals, note that the following ones are forbidden in this work (more can be added to this list so make sure to check it regularly):
It’s best to use a dataset for which there is no ready made analysis in internet, but if you choose a dataset used already in some online case study, provide the link to previous studies and report how your analysis differs from those (for example if someone has made non-Bayesian analysis and you do the full Bayesian analysis).
Depending on the model and the structure of the data, a good data set would have more than 100 observations but less than 1 million. If you know an interesting big data set, you can use a smaller subset of the data to keep the computation times feasible. It would be good that the data has some structure, so that it is sensible to use multilevel/hierarchical models.
If you’re looking for inspiration or you’re not sure where to begin, take a browse over this list of datasets arranged by topic, the datasets mentioned in the lecture slides (see slide 6), or else look at some of these publically accessible databases:
b ~ normal(mu, sigma),
mu ~ normal(0, 1),
sigma ~ exponential(1).
uniform(a, b)should not be used unless the boundaries are really logical boundaries and values beyond the boundaries are completely impossible.
rstanarmcan be used, but you need to report the priors used (including reporting the priors
rstanamrassign by default).
The following case study examples demonstrate how text, equations, figures, and code, and inference results can be included in one report. These examples don’t necessarily have all the workflow steps required in your report, but different steps are illustrated in different case studies and you can get good ideas for your report just by browsing through them.