Behind the Scenes: How We Scraped 250 Thousand Candidate Data for the 2024 Election

Behind the Scenes: How We Scraped 250 Thousand Candidate Data for the 2024 Election

Zakiego

Zakiego

@zakiego

Read in Bahasa Indonesia 🇮🇩

Background

This project started from a tweet by Mas Gilang (@mgilangjanuar) on December 15, 2023. However, it was only on January 23, 2024, that this data collection project was completed. Many challenges were faced, which is why I want to share the lessons learned from this project.

As a key note, this project was just a side project. It had no specific purpose other than the enjoyment of data scraping. Not a single cent was earned from this data collection.

Method

The method used was a simple fetch in JavaScript:

const response = await fetch("http://example.com/movies.json");
const movies = await response.json();
console.log(movies);

The fetch was done with an endpoint that could be accessed publicly on the KPU (General Elections Commission) website. The response, which was in JSON and HTML formats, was then parsed to obtain clean data. That's it.

More specifically, there were three steps:

  1. getListDapil: First, gather the list of electoral districts (dapil)
  2. getListCalon: Then, fetch each dapil to get the list of candidates in that area
  3. getProfilCalon: Finally, use the names to fetch with a POST method to an endpoint to retrieve the candidate profiles

Problems

During the scraping process, there were many issues encountered, both from the scraper side and from the server side.

Large Data

const list = [];
const resp = await fetch('https://example.com/api').then((res) => res.json());
list.push(resp);
await Bun.write('data.json', JSON.stringify(list));

The data collected was stored in an array. The array was then saved in a JSON file.

Additionally, I used Bun (not NPM, PNPM, or Yarn) because it was easier to use.

If the data size were small, this wouldn't be a problem. However, in total, there were more than 250,000 candidates. For a JSON file, this amount would make the file large, making it slow to read and update.

So, I switched to SQLite, using Drizzle ORM. Compared to JSON, SQLite, being a real database, made the process of selecting, inserting, and updating data much easier.

Marking

Unfortunately, because there was so much data, not all fetch requests for profiles succeeded. This was due to the server's limited ability to respond to requests.

There were 250,000 candidates in this election, meaning I needed to perform 250,000 getProfilCalon fetches to get the profiles of all candidates—an enormous number.

In such a situation, it was impossible to do everything in one go. Why? Because in the middle of the process, some fetch requests would fail, either because the server was overloaded or because my network went down.

Therefore, I had to create a workaround to tolerate failures. The solution was that when the script was rerun, it would only fetch profiles for candidates that hadn't been fetched yet.

How did I mark which candidates had already had their profiles fetched? This is where SQLite made things easier.

When running the getListCalon function, I retrieved the list of candidate names. For each name, I added a column is_fetched. If getProfilCalon succeeded, it was marked as true.

| nomor | nama          | is_fetched |
|-------|-------------- |-----------|
| 1     | John Doe      | true      |
| 2     | Jane Smith    | true      |
| 3     | Bob Johnson   | true      |
| 4     | Alice Brown   | true      |
| 5     | Charlie Davis | false     |
| 6     | Eve Wilson    | false     |

As a result, I could query like this:

const listNotFetched = SELECT * FROM list_anggota WHERE is_fetched = false;

From listNotFetched, we then run getProfilCalon to get the profile details of the candidates. Since only is_fetched = false is queried, there's no need to fetch everything from the start; we only fetch the profiles that haven't been retrieved yet.

Separating Commands

Still related to failure tolerance, it was impractical to rerun the entire script every time. For context, there were 4 categories of data collected:

  • DPD
  • DPR RI
  • DPRD Province
  • DPRD City/Regency

To tolerate failures and avoid running everything from the start, I split the commands to run the script.

const category = process.argv[2];
const command = process.argv[3];

switch (category) {
  case 'dpr':
    switch (command) {
      case 'get-list-dapil':
        dpr.getListDapil();
        break;
      case 'get-list-calon':
        dpr.getListCalon();
        break;
      default:
        console.log('dpr command not found');
    }
    break;
  ...
  default:
    console.log('Unknown command');
}

With the code above in index.ts, the command would vary according to need. For example, to get the list of dapil for DPRD Province, the command would be bun run index.ts dprd-provinsi get-list-dapil.

Similarly, to get the list of DPR RI candidates, the command would be bun run index.ts dpr-ri get-list-calon.

Batch Fetching

Remember, we had to perform 250,000 fetches. There are several options for doing this:

  1. Default: For each fetch, we wait (await) for one fetch to complete before doing the next one. In other words, we can only perform one fetch at a time, making this very slow. Assuming each fetch takes 2 seconds, multiplying by 250,000 would require 500,000 seconds, or 5.79 days. That’s a very long time. And this is just the fetching process, not including data processing.
  2. Parallel: Fetches are performed simultaneously. Imagine firing off 250,000 fetches to a single server at once—what would happen? Right, the server would crash.
  3. Batching: A middle ground between one-at-a-time fetch and all-at-once fetch. Here, we divide the fetches into batches, say 100 fetches per batch. After one batch finishes, we move to the next one. This method is more server-friendly.

The third method was used. Luckily, Mas @gadingnstn had created a library called Concurrent Manager, which handles batch concurrent promises, so I didn’t have to write a batching script from scratch.

import ConcurrentManager from 'concurrent-manager';

const concurrent = new ConcurrentManager({
  concurrent: 10,
  withMillis: true,
});

for (const calon of list) {
  concurrent.queue(() => {
    return doSomethingPromiseRequest();
  });
}

concurrent.run();

Anti-Bot

After solving the problems on our side, there was another issue with the server. When we performed 50 fetches at once, the server would automatically block the IP for a certain period.

To resolve this, I did two things:

1. Delaying Between Fetches

Initially, 1000 fetches were fired off at once. However, the response was a 403 error because the server blocked the IP. Reducing the number to 50 fetches and adding a few seconds of delay between each fetch solved the problem.

export const sleep = (ms: number) =>
  new Promise((resolve) => setTimeout(resolve, ms));

export const sleepRandom = async (min: number, max: number) => {
  const delay = Math.floor(Math.random() * max) + min;
  await sleep(delay);
};

export const customFetch = async (url: string, options?: RequestInit) => {
  await sleepRandom(1_500, 3_000);
  const res = await fetch(url, { ...options });
  return res;
};

2. Using Cloudflare Worker

Even though I solved the blocking issue by reducing concurrent fetches to 50 with delays, this made the process slower. After trying different approaches, I realized the rate-limiting was based on IP. So, I used Cloudflare Workers to distribute fetches across different IPs.

The simplified setup became:

  • Before: User -> KPU Server
  • After: User -> Cloudflare Worker -> KPU Server

Cloudflare Workers made the request using their own IPs, so my IP was not blocked.

Here’s the basic code for the worker:

var src_default = {
  async fetch(request, env, ctx) {
    console.log("Incoming Request:", request);
    const targetURL = "https://infopemilu.kpu.go.id/Pemilu/Dct_dprprov/profile";
    const response = await fetch(targetURL, request);
    console.log("Response from Target:", response);
    return response;
  },
};

export {
  src_default as default,
};

To speed up data scraping, I created 3 Cloudflare Workers. The final setup looked like this:

  • User -> KPU Server
  • User -> Cloudflare Worker 1 -> KPU Server
  • User -> Cloudflare Worker 2 -> KPU Server
  • User -> Cloudflare Worker 3 -> KPU Server

Conclusion

This data scraping project became a valuable learning experience. Even though it started as a simple experiment, it revealed challenges and solutions for dealing with large-scale data scraping. By addressing issues like large data, batch fetching, anti-bot measures, and database management, I gained useful insights into web scraping techniques and problem-solving strategies.