West Nile virus forecasting challenge: Predicting the annual number of West Nile virus neuroinvasive disease cases by U.S. county
West Nile virus (WNV) is the leading cause of arboviral disease in the contiguous United States. An estimated 70–80% of WNV infections are asymptomatic; 20–30% of infected persons develop an acute systemic febrile illness and <1% of infected persons develop neuroinvasive disease (e.g., meningitis, encephalitis, or myelitis). Among patients with neuroinvasive disease, the case-fatality ratio is approximately 10%. Due to its severity and distinctive clinical features, diagnosis and reporting of neuroinvasive disease is considered more consistent and complete than that of non-neuroinvasive disease.
The first cases of WNV disease in the United States were identified in New York City in 1999; the virus subsequently spread westward, reaching the Pacific coast in 2003. Since then, WNV has caused seasonal summer outbreaks that vary in size and location, with most areas having sporadic disease or intermittent outbreaks. No vaccine or specific treatment for WNV infection is currently available. Reducing mosquito exposure through vector control and personal protective behaviors are the primary forms of prevention. Predicting where and when WNV transmission will occur could help direct public health control efforts.
This is an open forecasting challenge to predict the total number of neuroinvasive WNV disease cases for each county in the contiguous United States that will be reported to ArboNET during the 2020 calendar year. The forecasting target is described on the Target page. The counties and historical data for all counties are available on the Data page. Participation guidelines are described on the Participation page and evaluation criteria are described on the Evaluation page.
- Project announcement and historical data release: Winter 2020.
- Initial forecast due: April 30, 2020.
- Additional forecasts due (optional): May 31, June 30, and July 31, 2020.
How to participate
To participate in the challenge, one team member must register on this website and, after logging in, register specifically for the 2020 WNV Forecasting Challenge (instructions for registration). The forecast submissions are connected to this account. Only one set of forecasts is allowed per account and only one account per set of forecasts (i.e. the same account must be used for all submissions).
Full participation requires:
Electronic submission of forecasts for all included counties by the initial deadline (details below and Evaluation page).
Submission of a model description document by email (See Model Methodology page).
Forecasts should be made in csv files matching the format in this template. Each csv should contain forecasts for all counties (n= 3,108). For internal record keeping, teams may find it useful to include the forecast due date or submission date in the file name. Due to county name and FIPS code changes, some counties may not appear exactly the same in this template as they do in the historical data. Only forecasts for counties that exist in 2020 are required. See the Data page for more information.
The forecast file includes a set of lines for each forecast representing binned probabilities for the range of outcomes. Each bin is defined by an inclusive minimum and a non-inclusive maximum, for example, the bin defined by
bin_start_incl = 1 case and
bin_end_notincl = 6 cases is assigned the probability that the number of cases is greater than or equal to 1 and less than 6 (i.e. 1, 2, 3, 4, or 5 cases are reported, 1 <= x < 6). The following set of bins are used for each forecast: 0 <= x < 1, 1 <= x < 6, 6 <= x < 11, …, 46 <= x < 51, 51 <= x < 101, 101 <= 151, 151 <= 201, 201 <= 1000. Each of these bins should have a probability between 0 and 1.0 (inclusive) and the sum of the probabilities assigned to each set of bins for one county should be 1.0. The forecast file also includes a line for each forecast representing the point prediction i.e. the most likely outcome for the specific target. A value for point prediction is required for submission; however, the point prediction will not be evaluated for this challenge.
Each row in the submission file represents a single bin and includes the following columns:
location: “State” and “County” as written in the data files with a hyphen: “State-County”. For example, “California-San Diego” or “Texas-Harris”. Do not include the word “County” and include spaces between words within the county or state name. The easiest way is to accomplish this is by matching the template available above to the input data.
target: “Total WNV neuroinvasive disease cases”
type: “Bin” or “Point”. “Bin” specifies that the prediction is for a bin covering a range of possible outcomes. “Point” specifies the total predicted cases but will not be evaluated.
binstartincl: The inclusive lower bound for the bin, e.g. 0, 1, 6, 11, …, 151, 201.
binendnotincl: The non-inclusive upper bound for the bin, e.g. 1, 6, 11, 16, …, 201, 1000.
value: A probability for the number neuroinvasive disease cases in the bin defined by
bin_end_notincl. This probability should be greater than or equal to 0 and less than or equal to 1.0 for all bins per county. Value for ‘point’ predictions can be zero or any positive integer and must be present but will not be evaluated for this challenge.
Registered participants will have access to a Submit page. The individual csv files can be uploaded on that page any time before the specific deadlines. The submission format will be automatically validated when the forecast is uploaded. Forecasts will be made for all of 2020 and initial forecasts will be due on April 30. Updated forecasts may be submitted by May 31, June 30, and July 31, 2020. Updated forecasts may use newly acquired data or updated methods, but are not required. Forecasts may be submitted and updated at any time prior to the due date. Successful submission can be checked by click on the “Open JSON” link (the JSON format is a format used by the server).
Final reported data for 2020 will be provided to all participants in spring 2021. An analysis will be conducted using the average logarithmic score to assess and compare forecasts across all counties at each time point. A joint manuscript will be prepared to disseminate findings on this comparison and the general performance of submitted forecasts. Participants may publish their own forecasts and results at any time.
To be eligible, teams must:
Submit forecasts for all 3,109 counties.
Submit forecasts electronically prior to the deadline (April 30, 2020).
Submit a model description (see Model Methodology page).
Preliminary results will be distributed to all teams.
If ;;p;; is the set of probabilities for a given forecast, and ;;p_i;; is the probability assigned to the observed outcome ;;i;;, the logarithmic score is:
$$S(p, i) = ln(pi)$$ For each forecast of each target, ;;pi;; will be set to the probability assigned to the single bin containing the observed outcome. Undefined natural logs (which occur when the probability assigned to the observed outcome was 0) will be assigned a value of -10.
- Gneiting T and AE Raftery. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 102(477):359-378. Available at: https://www.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf.
- Rosenfeld R, J Grefenstette, and D Burke. (2012) A Proposal for Standardized Evaluation of Epidemiological Models. Available at: http://delphi.midas.cs.cmu.edu/files/StandardizedEvaluationRevised12-11-09.pdf.
Model description submission
The initial forecast submission should be accompanied by a completed Model description form submitted by email to the organizers (email@example.com). If updates are made to the model for subsequent forecasts, an updated model description should be provided to the organizers.