The list of algorithms was compiled by finding references to the algorithms in government websites. Our methods are meant to be pragmatic not exhaustive, so there may be gaps. We look forward to you submitting your tips and even volunteering to help vet tips and add them to the database to flesh it out both domestically and internationally over time.
First, we created a list of search terms that are or should be related to algorithms. The challenge was to predict how algorithms are being referred to across all government agencies.
We then used those terms in Google searches, filtering only .gov results. Initially, results included a high degree of noise due to research papers, hosted in government websites, that described their methodologies using many of the same search terms. While informative, it was not clear that such research papers would lead to socially relevant algorithms on the cusp of adoption by government agencies. Thus we decided to exclude those research papers. Because it hosted a majority of those research papers we removed the National Institutes of Health (NIH) website from our search. As such the database may under-represent health-related algorithms.
The next step was scraping all the websites returned from the Google search. There were 5,337 links in total, with 300 showing up more than once across different search queries. We extracted the top level url from the links and, comparing them to a database of .gov domains compiled by the General Services Administration (GSA), we could categorize each link to their respective agency.
Of the 5,337 links, 2,908 were from federal agencies, which is the initial focus of this research. The remaining 2,429 belong to state and local governments.
Next we needed to determine how many of the 2,908 links actually pointed to an algorithm. An algorithm is a set of rules to which data can be input and from which a result – a decision, a recommendation, a score – is obtained. For our purposes, government algorithms are either actively being used in government operations or are being endorsed by the government to assist third-party actions. Algorithms also can be either computational (a computer software or spreadsheet) or not computational (a weighted score card or flowchart). The links that were returned from the search were not always indicative of actual algorithms. Sometimes the web results pointed to pages in which the terms that we defined as being oriented towards algorithms were being used to describe processes that were either not automated or not used by government. The distinctions of how we tagged each link are further elaborated below under the section “metadata descriptions”
Having boiled down the list of actual algorithms used or endorsed by the government, the next step was determining if these algorithms had the potential of being interesting to journalists. We decided to create a newsworthiness “rating” to attempt to capture whether the public would be interested to know more about whether the algorithm affected them
For that, we turned to Harcup and O’Neill (“What is news? News values revisited (again)”, 2016), in their proposal of “an updated set of contemporary news values that, in various combinations, seem to be identifiable within published news stories.” Harcup and O’Neill write that “potential news stories must generally satisfy one and preferably more” of 15 criteria. Our idea was to work backwards from these criteria to determine how they are applicable to algorithmic accountability.
Out of those 15 criteria, we found five that we felt could meaningfully be applied to algorithms:
From that, we formulated questions to evaluate the potential newsworthiness of each algorithm:
- Can this algorithm have a negative impact if used inappropriately?
- Can this algorithm raise controversy if adopted?
- Is the application of this algorithm surprising?
- Does this algorithm privilege or harm a specific subset of people?
- Does the algorithm have the potential of affecting a large population or section of the economy?
If the answers for any of these questions were “yes”, the algorithm could be included on the list. However, this was a discretionary process, and some questions have more value than others. A “yes” in the magnitude question may not be as important as a “yes” in the negative impact question, for instance.
For each of the links we compiled for our database we further enriched it with the following fields in an attempt to characterize the various shades of algorithms we were encountering.
- Name: the name of the algorithm (trademarked name or working name) or, in the absence of a clear name, a short description.
- Description: General description of what algorithm does, often quoting its own documentation.
- Why it is important: Possible impacts of the algorithm as they relate to the newsworthiness factors above.
- Topic: General fields or domains where the algorithm is used.
- Jurisdiction: name of the country, state, or local government associated with deploying the algorithm.
- Government level: whether that jurisdiction is federal, state, or local.
- Agency: name of the government agency that uses or recommends the algorithm.
- Proprietary: Whether or not that algorithm was developed by a contractor and is owned by a company.
- Creator/author/vendor: Name of company or agency that created that algorithm.
- Date: Month and year in which the algorithm was launched or updated.
- Adoption Stage: An algorithm can be directly used by the government (“Active use”) or it may simply be developed or shared by the government to be used by other governmental or non-governmental enterprises (“Endorsement for use”). In some cases, it is being studied or evaluated for use but not adopted yet (“Potential use”).
- Computational: Whether or not the algorithm is implemented in software as opposed to a non-software calculation such as a flowchart.
- Link: URL that links to algorithm documentation showing initial evidence of existence.