How COVID19Japan com Works

It’s been two years and counting, and we’re still in this pandemic.

I helped build the covid19japan.com site with Shane Reustle and a few others back when the pandemic started and we wanted to do something to help people keep informed.

The reason I got involved initially was because I also wanted to know what the case numbers were in Japan, but also help get the word out to those who were not able to read Japanese to keep abreast on what is going on. It also seemed like an interesting project to work on a data viz project that was helpful.

During the time, many many data tracking sites popped up — like the one from the Tokyo Metropolitan Government, that became an open source project.

The site itself is based primarily on two sources of data, MHLW (Ministry of Health, Labor and Welfare) and NHK (the national broadcaster). The site merges those two sources of data into a Google Spreadsheet which we update every day by hand and by some automated tools.

At first, we manually scanned through the news articles that NHK publishes and inputted the numbers we were seeing, along with the URL as a reference. Each prefecture’s numbers are reported in a single NHK article which lists the number of people testing positive that day, deaths that day and any other data. It also publishes a summary article that gets updated during the day with list of all the prefectures numbers, differences, including the total patients, recovery and deaths for the whole of Japan.

From the MHLW site, we look at their total counts press release and data about patients. There is also a number we don’t get from NHK, which is the number of cases that are detected at our ports.

This is quite labor intensive, and requires a lot of clicking around and typing. So some tools were built to make the input of the data easier. In the last few months, I had built up a bit of automation to use OCR (tesseract) to extract numbers from MHLW reports (which still publishes a screenshot from what I guess is a Excel file) and PDF files with data. The NHK news articles are also automated to match on those articles and extract out the key figures they publish. All of these are run on Google Cloud Functions. The automation is not perfect, so we go in every day to verify the numbers and make corrections.

From the Google Spreadsheet the data is pulled every 15 minutes via the Google Spreadsheets API from a Github Action, summarized using a Node JS script into a JSON file. The JSON file is checked in and published through Github Pages. Charts are also generated in this script using d3.js and saved as simple SVG files.

The site uses this JSON file and the SVG charts to render the site. The site is mostly pure JS, pulled together using webpack. There has been a partial rewrite of the site using ReactJS. The site itself uses MapBox for the map, though because of cost reasons, it has a static map which then is interactive if you click on it. The interactive charts (that are not SVG) are drawn using c3.js (which uses d3.js).

Over the two years, we’ve rewritten parts of the site several times to deal with issues, mainly scaling issues.

Google Sheets API Failures

At the start, we would hit the Google Sheets JSON feed API to render the data directly. This meant that as soon as we edited the spreadsheet, the data would be live. Several times, bad data input would cause the site to go down. Then occassionally, the API endpoints will get quota issues, which not only made the site not load any data, but also we could even edit the spreadsheet.

This was not great, and got worse as the data increased and traffic increased. Shane had already moved the data to a JSON file rather than fetching straight from the API. We made one more change that made the data get served from Github Pages and accessing the data from a Github Action that ran every 15 minutes. This solved a lot of our failures.

Lesson: Don’t hit a Google Sheets API directly from a public website.

Reconciling Data from Governments

At the start of the pandemic, prefectural governments, city governments and MHLW would publish the most detailed reports of every patient, including where they had been, etc. NHK would be using the same data in their news reports. In terms of freshness of data, prefectural and city government would publish the most fresh data, then followed by NHK that would be using this data, then a day later, MHLW would publish their aggregated data.

So our approach was to get the most up to date data, so we’d have a huge table of prefectural government sites that we’d check manually and input the data. Each site had a different format and it would be a huge task to just go through all of these. The format also kept on changing so it was near impossible to automate. Most data would also be in PDFs that would also be hard.

The hardest thing of all was that the data between prefectural and city governments would be duplicated or overlapped. So we’d have to dedupe the data to not over count.

It was generally a mess, but about a year into the pandemic, NHK pretty much had a system that resolved all those conflicts and we would depend on the NHK data to be the “source” of truth.

MHLW’s data was always at least a day late. They would always report in the afternoon, yesterday’s number, which meant if you watched the news, the numbers that MHLW reported were different from what the prefectural governments were reporting. So we only used MHLW to report the recovery numbers as the prefectural governments weren’t reporting those consistently.

A few months into the pandemic, a group of Hack to Japan(?) developers had developed a site for the Tokyo Metropolitan Government called stopcovid19. It was a GitHub repo of a Rails(?) app that governments could use to format their data and get a data viz solution for free. Governments generally did not have the technical know-how to use this, so many different volunteers made sites for their home prefectures. A year on though, many of those sites have stopped working, some of them had been taken over by the prefectural governments and using that format.

Structure of the Spreadsheet

In the early days we had one row for every patient. We had many debates about this because it was cumbersome to input — but we thought that data might be useful (eg. knowing the city where people detected, etc). A few months later, we realized this data wasn’t really that useful, our data inputters were starting to omit that data, and prefectures even stopped publishing this data.

At one point, we ran out of rows for a tab. A tab couldn’t have more than 1M rows, so we had to split the sheet into multiple tabs, mainly for the large prefectures like Tokyo.

After about a year, we couldn’t really list all the patients out, the spreadsheet was super slow and the API calls were starting to fail. So we changed the format to only record the count rather than each individual patient.

Now, we just have one row per day per prefecture which means we lost a lot of granularity, but as less people were involved with the site and updating the data, this was the best compromise.

Automating the Data

At the start, when the stopcovid19 project was in full swing, we explored automating the collection of the data by using their data format. However, we soon discovered that their data was delayed just like MHLW was delayed — not only that, the format of the data was not consistent because each prefecture forked the data and modified it to suit their needs.

Data Discrepency

Today the main problem with the data is the discrepency between NHK’s counts and MHLW, and the application of data correction from the prefectures.

First, NHK and MHLW counts do not match. Earlier in the pandemic, this was because MHLW would double count patients if they got infected twice, but NHK would only count one. But NHK still has a deviated number than MHLW. This normally isn’t a problem, but during the troughs of the pandemic, we need to use the MHLW per-prefecture recovery number to calculate the active patients in the prefecture. For some of those prefectures, that would be 0. If NHK undercounted patients, then MHLW recoveries would exceed NHK and the number of active patients would be -1.

Secondly, prefectures often correct historical data. This means, sometimes they’d say, yesterday we reported 400 cases, but actually it was 399 once we deduped. That would mean we’d have to manually go back to find out which day they meant and removed one. It could create a discrpency when we go back to verify the data (diff from the news report). So instead, we just aggregate the correction as a single total of cases “removed” or “added”, and then apply a smear across our numbers to ensure the final number we have matches the NHK or MHLW numbers.