The MultiHumES (Multilingual Humanitarian Response Dataset for Extractive Summarization) is a dataset consisting of around 50K humanitarian documents in three languages: English, French, and Spanish. Among these documents, approximately 35K are annotated with informative snippets and can be used for the training and evaluation of extractive summarization models. A paper accepted by the European Chapter of the Association for Computational Linguistics (EACL) in 2021, describing the dataset and presenting three baselines on extractive summarization can be found later in this section.
The collection originated select organisations that used the DEEP platform. Although 82% of the uploaded documents come from publicly available sources, and more than 96% are labeled as non-confidential, we made an additional consultation process with the involved organizations to ensure that the released collection preserves the privacy and dignity of the affected population.The dataset contains the documents analyzed from 2016 to 2019, related to projects in 159 countries. Looking at the documents with specified originated sources, around 46% of the documents come from media sources, 29% from international organizations, and the rest from various organizations such as United Nations agencies, governments, academic and research institutions, NGOs, donors and Red Cross/Red Crescent Movement.