Project work involves choosing a data set and performing a whole analysis according to all the parts of Bayesian workflow studied along the course.

- The project work is meant to be done in period II.
- In the beginning of the period II
- Form a group. We prefer groups of 3, but the project can be done in groups of 1-2.
- Select a topic. You may ask in the course chat channel #project for opinion whether it’s a good topic and a good dataset. You can change the topic later.
- Start planning.

- The main work for the project and the presentation will be done in the second half of the period II after all the workflow parts have been discussed in the course.
- The in person presentations will be made on the evaluation week after period II.
- All suspected plagiarism will be reported and investigated. See more about the Aalto University Code of Academic Integrity and Handling Violations Thereof.

- Form a group and pick a topic. Register the group before end of 8th Nov, 2022.
- Groups of 3 can reserve a presentation slot starting 9th Nov, 2022.
- Groups of 1-2 can reserve a presentation slot starting 10th Nov, 2022.
- Groups that register late can reserve a presentation slot starting 11th Nov, 2022.
- Work on the project. TA session queue is also for project questions.
- Project report deadline 4.12.2022. Submit in peergrade (separate “class”, the link will be added to MyCourses).
- Project report peer grading 4. - 8.12.2022 (so that you’ll get feedback for the report before the presentations).
- Project presentations 12.-16.12.2022.

Project work is done in groups of 1-3 persons. Preferred group size is 3, because you learn more when you talk about the project with someone else.

If you don’t have a group, you can ask other students in the group chat channel **#project**. Tell what kind of data you are interested in (e.g. medicine, health, biological, engineering, political, business), whether you prefer R or Python, and whether you have already more concrete idea for the topic.

Groups of 3 students can choose their presentation time slot before 1-2 student groups. 3 person group is expected to do a bit more work than 1-2 person groups.

You can do the project alone, but the amount of work is expected to the same for 2 person groups.

The groups will get help for the project work in TA sessions. When there are no weekly assignments, the TA sessions are still organized for helping in the project work.

The project work’s evaluation consists of

- peergraded project report (40%) (within peergrade submission 80% and feedback 20%)
- presentation and oral exam graded by the course staff (60%)
- clarity of slides + use of figures
- clarity of oral presentation + flow of the presentation
- all required parts included (not necessarily all in main slides, but it needs to be clear that all required steps were performed)
- accuracy of use of terms (oral exam)
- responses to questions (oral exam)

In the project report you practice presenting the problem and data analysis results, which means that minimal listing of code and figures is not a good report. There are different levels for how data analysis project could be reported. This report should be more than a summary of results without workflow steps. While describing the steps and decisions made during the workflow, to keep the report readable some of the **diagnostic outputs and code** can be put in the **appendix**. If you are uncertain you can ask TAs in TA sessions whether you are on a good level of amount of details.

The report **should not be over 20 pages** and should include

- Introduction describing
- the motivation
- the problem
- and the main modeling idea.
- Showing some illustrative figure is recommended.

- Description of the data and the analysis problem. Provide information where the data was obtained, and if it has been previously used in some online case study and how your analysis differs from the existing analyses.
- Description of at least two models, for example:
- non hierarchical and hierarchical,
- linear and non linear,
- variable selection with many models.

- Informative or weakly informative priors, and justification of their choices.
- Stan, rstanarm or brms code.
- How the Stan model was run, that is, what options were used. This is also more clear as combination of textual explanation and the actual code line.
- Convergence diagnostics (\(\widehat{R}\), ESS, divergences) and what was done if the convergence was not good with the first try.
__This should be reported for all models.__ - Posterior predictive checks and what was done to improve the model.
__This should be reported for all models.__ **Optional/Bonus**: Predictive performance assessment if applicable (e.g. classification accuracy) and evaluation of practical usefulness of the accuracy. This should be reported for all models as well.- Sensitivity analysis with respect to prior choices (i.e. checking whether the result changes a lot if prior is changed).
__This should be reported for all models.__ - Model comparison (e.g. with LOO-CV).
- Discussion of issues and potential improvements.
- Conclusion what was learned from the data analysis.
- Self-reflection of what the group learned while making the project.

You can check also the peergrade rubric for the project report (updated 2022-11-30 18:00).

For guidance on how many digits to report, see Digits notebook.

In addition to the submitted report, each project must be presented by the authoring group, according to the following guidelines:

- The presentation should be high level but sufficiently detailed information should be readily available to help answering questions from the audience.
- The duration of the presentation should be 10 minutes (groups of 1-2 students) or 15 minutes (groups of 3 students).
- At the end of the presentation there will be an extra 5-10 minutes of questions by anyone in the audience or two members of the course staff who are present. The questions from lecturer/TAs can be considered as an oral exam questions, and if answers to these questions reveal weak knowledge of the methods and workflow steps which should be part of the project, that can reduce the grade.
- Grading will be done by the two members of the course staff using standardized grading instructions.

Specific recommendations for the presentations include:

- The first slide should include project’s title and group members’ names.
- The chosen statistical model(s), including observation model and priors, must be explained and justified,
- Make sure the font is big enough to be easily readable by the audience. This includes figure captions, legends and axis information,
- The last slide should be a summary or take-home-messages and include contact information or link to a further information. (The grade will be reduced by one if the last slide has only something like “Thank you” or “Questions?”),
- In general, the best presentations are often given by teams that have frequently attended TA sessions and gotten feedback, so we strongly recommend attending these sessions.

More details on the presentation sessions in 2022

- Presentations will be given on campus
- If you reserved a presentation slot but need to cancel, do it ASAP via the reservation link on MyCourses. If you cancel after the reservation link has closed, send a message to Elena Shaw (TA) on Zulip.
- As we have many presentation in each slot join the meeting in time. Late arrivals will lower the grade. Very late arrivals will fail the presentation and can present in period III.
- It is easiest if just one from the group shares the slides, but it is expected that all group members present some part of the presentation orally.
- Presentation time is 10 min for 1-2 person groups and 15min for 3 person groups
- Time limit is strict. It’s good idea to practice the talk so that you get the timing right. Staff will announce 2min and 1min left and time ended. Going overtime reduces the grade.
- After the presentation there will be 5min for questions, answers, and feedback.
- Each student has to come up with at least one question during the session. Students can ask more questions.
- Staff will ask further questions (kind of oral exam)
- Grading of the project presentation takes int account
- Students show understanding of essential concepts covered in the course
- TAs will ask clarifying questions to check these

- Inclusion of required workflow steps
- Clarity of the problem, model and prior description
- Focus of the presentation
- focus on the most relevant points in the limited time

- Overall clarity and coherence
- how well the slides and presentation flows
- effective presentation of the results

- Slide design and visual aspects
- e.g. use of figures, font size, number of digits
- effective use of figures

- Lack of typos, grammatical and spelling errors
- Oral
- clarity of oral presentation, rhythm and timing, rate of speech
- clarity of voice projection and appropriate volume
- presenting style created interest and was engaging
- presence (e.g. speaking to the audience / camera)
- possible distracting mannerisms

- Timing
- point reduction if overtime
- fair amount of time for all group members

- Bonus: going beoynd the expectations
- you may get a bonus point for going beoynd the explicit requirements

- Students show understanding of essential concepts covered in the course
- Students will also self-evaluate their project. Instructions for this will provided.

As some data sets have been overused for these particular goals, note that the following ones are forbidden in this work (more can be added to this list so make sure to check it regularly):

- extremely common data sets like titanic, mtcars, iris
- Baseball batting (used by Bob Carpenter’s StanCon case study).
- Data sets used in the course demos

It’s best to use a dataset for which there is no ready made analysis in internet, but if you choose a dataset used already in some online case study, provide the link to previous studies and report how your analysis differs from those (for example if someone has made non-Bayesian analysis and you do the full Bayesian analysis).

Depending on the model and the structure of the data, a good data set would have more than 100 observations but less than 1 million. If you know an interesting big data set, you can use a smaller subset of the data to keep the computation times feasible. It would be good that the data has some structure, so that it is sensible to use multilevel/hierarchical models.

If you’re looking for inspiration or you’re not sure where to begin, take a browse over this list of datasets arranged by topic, the datasets mentioned in the lecture slides (see slide 6), or else look at some of these publically accessible databases:

- Every parameter needs to have an explicit proper prior. Improper flat priors are not allowed.
- A hierarchical model is a model where the prior of certain parameter contain other parameters that are also estimated in the model. For instance,
`b ~ normal(mu, sigma)`

,`mu ~ normal(0, 1)`

,`sigma ~ exponential(1)`

. - Do not impose hard constrains on a parameter unless they are natural to them.
`uniform(a, b)`

should not be used unless the boundaries are really logical boundaries and values beyond the boundaries are completely impossible. - At least some models should include covariates. Modelling the outcome without predictors is likely too simple for the project.
`brms`

and`rstanarm`

can be used, but you need to report the priors used (including reporting the priors`brms`

and`rstanamr`

assign by default).

The following case study examples demonstrate how text, equations, figures, and code, and inference results can be included in one report. These examples don’t necessarily have all the workflow steps required in your report, but different steps are illustrated in different case studies and you can get good ideas for your report just by browsing through them.

- BDA R and Python demos are quite minimal in description of the data and discussion of the results, but show many diagnostics and basic plots.
- Some Stan case studies focus on some specific methods, but there are many case studies that are excellent examples for this course. They don’t include all the steps required in this course, but are good examples of writing. Some of them are longer or use more advanced models than required in this course.
- Bayesian workflow for disease transmission modeling in Stan
- Model-based Inference for Causal Effects in Completely Randomized Experiments
- Tagging Basketball Events with HMM in Stan
- Model building and expansion for golf putting
- A Dyadic Item Response Theory Model
- Predator-Prey Population Dynamics: the Lotka-Volterra model in Stan
- Hierarchical model for motivational shifts in aging monkeys

- Some StanCon case studies (scroll down) can also provide good project ideas.