MSc. data science students write project assignment at Rooftop

The project team investigates how well weakly supervised machine learning, selects and labels relevant information from court sentences, which are of importance for being able to predict the verditcs for a given indictment.

The data science assignment is to identify relevant indictment information from a large sample of court sentences. More specific, we use Snorkel ( to select relevant entities (ex. accused, drugs, the quantity of drug) and entity-relations (ex. The wordings like ‘the accused passed on the drugs against payment’), which are relevant information for being able to predict the outcome of a verdict.

We have already OCR’ed court sentences publicized by the national prosecution authority in Denmark, and structured and labeled the most important data, but nowhere the severity of the indictments is explicitly stated.

Out of the national prosecution authority’s published court sentences, we look at 1300 Danish City court sentences representing violation of 259 different paragraphs.

The first step is to categorize the severity of the verdicts. Secondly, we identify wordings in the indictment which is correlated with the severity and length of the verdict for the different types of crime. 

Also, the assignment is to identify, train and test the best model for prediction, then to apply Explainable AI to get insights on which terms in the indictments are the most informative, when it comes to predicting outcomes. We engage with a legal expert to verify the approach and to ensure intuitive and correct results.

We offer Frederik Andersen a steep learning curve – in all dimensions, and are excited to be able to contribute with a data science real-life-experience.