Scanning at Scale

“ We’ve chosen scale, and the conceptual apparatus to manage it, at the expense of finer-grained knowledge that could make a more just and equitable arrangement possible.” (Posner)

In a former Christian Science church in San Francisco, petabytes of servers store thousands of cassettes, millions of books, and billions of web pages. This place where technophiles come to worship is known as the Internet Archive. Founded in 1996 by a dot com era billionaire, Brewster Kahle, the Internet Archive initially aimed to archive all of the world wide web. 27 years later in 2023, the Internet Archive is now the world’s largest free digital library boasting over 800 billion webpages, 37 million books and texts, 15 million audio recordings, 9.7 million videos, 4.6 million images, and 983 thousand software programs. As such, it is a crucial piece of scholarly infrastructure enabling the study of culture at scale.

Aside from the 800 billion webpages, the digitization of which can be automated via web crawlers, human workers digitize all other content of the internet archive. This project is concerned with the digitization process through which the second most common content type–books–were ingested into the archive. Internet Archive announced the launch of its book scanning program with a December 2004 blog post. According to the post, the program began as a partnership between Internet Archive and ten academic libraries across the world in an effort to digitize free, open access versions of books made available to the public via the web (Kaplan “Open Text Archives”). In the months that followed, Internet Archive set up scanning centers in partner-institution libraries. While Internet Archive celebrates some scanning workers and scanning centers on their blog, the process through which the books are scanned remains opaque.

Scanning Labor in the Internet Archive is a DH project that seeks to remedy this gap. Through a data-enabled spatial, labor, and oral history, Scanning Labor sheds light on and reappraises the work of book scanning workers in the Internet Archive. Scanning Labor analyzed 2.5 million metadata records of books and texts that IA scanning workers digitized between 2004 and 2022. Each scanning record contains information about the date upon which the book was scanned and the name of the scanning center at which the scan was captured. While the 2.5 million records are a small subset of all the books and texts in the internet archive, analysis of them revealed that around 2011 Internet Archive shifted most of its book scanning operations from in-house scanning centers at academic libraries in the global north to shipping vast quantities of books overseas for digitization at business process outsourcing firms in the global South. The shift from academic library scanning operations is also associated with a sharp uptake in the number of total books scanned per month and the number of books scanned per worker per day. However, IA rarely mentions these outsourced scanning centers in public press releases, and when pressed about them, Internet Archive officials ceased communication with our team and engagement with the Scanning Labor project.

This current project investigates the outsourcing pattern that the Scanning Labor team identified through pairing analysis of scanning metadata records with bills of lading that record the import of goods from overseas suppliers to Internet Archive’s US locations. This project is interested in piecing these two data sources together to understand the infrastructure and workers who have made possible internet archive’s production of culture at scale through visualizing probable connections between suppliers, scanning centers, scanning workers, and the books they digitized. Ultimately, this project is an experiment in computationally-aided critical digital humanities. Through my never certain speculative data visualizations, I attempt to use computational techniques to re-embed IA data in the contexts of its creation, reappraise the work of the people who made it, and through doing so breakdown the black boxing of data, workers, and users that I will suggest culture at scale hinges upon.

A Note On Scale

From IA’s book scanning project’s start in 2004, scanning workers have added 37 million books to the website. And yet, every book contained within it–be it scanned in 2008 or 2023, is rendered in the same way. A flattened image of a book page on a black background. The platform for navigating them is uniform, the location of the metadata for them, always below, and even the dimensions of each book appear identical. Indeed, for these books to be made discoverable and readable on the platform, they must be made to scale.

Scale, according to Anna Tsing, describes a system that can expand without re-thinking the elements arranged within it. This particular type of expansion “is possible only if project elements do not form transformative relationships that might change the project as elements are added” (Tsing 507). As such, a scalable system by definition precludes diversity of the objects and agents included within it and bars relationships among its elements. For example, a standard query language database is a scalable system in the sense that each object in the database has predetermined relationships with other elements within it, but any new relationships that an added record may bring with it cannot be rendered into the database without changing its underlying design. Naturally occurring systems, of course, do not scale. For there are no real relationships that do not transform constituent parts, and expansion without change is not actually possible. Rather, these so called scalable projects are “at best…articulations between scalable and nonscalable elements, in which nonscalable effects can be hidden from project investors” (Tsing 515).

Scalable projects hide transformative relationships through their modular design. Miriam Posner suggests that “modular systems manage complexity by “black-boxing” any information or relationships that do not need to be known at a node in a system for the whole system to function.” (Posner). Posner offers the example of the shipping container which manages the complexity of global supply chains through making it such that “one doesn’t need to know what’s in the box, just where it needs to go” (Posner). In the case of a SQL database, the design makes invisible connections between records in the database that do not share query-able keys, or elements that are the same across all elements of the database.

Leveraging a modular database design, Internet Archive scans books at scale through obfuscating transformative relationships between the digitized book scans, physical codices, scanning centers, scribe machines, and scanning workers that created them. Internet archive’s scanning infrastructure consists of 4 interlocking parts: books, scanning centers, workers, and scribe machines. Through replicating this arrangement of material resources, tools, and labor, internet archive has been able to scale up their scanning program to digitize 37 million books. In the case of IA, the database infrastructure hides these transformative relationships from a user through representing books identically regardless of scanning center and making metadata fields like scanning centers and scanning operators searchable only via custom query. However, through mining book metadata, this project has made visible the particularities of each scanning center and has begun to unearth the transformative relationships between book, scanning center, machine, and human operator.

DH and (Mass) Digitization

Text-based mass digitization projects like that of Internet Archive differ from their earlier counterparts due to their industrial scale, totalizing aims, and speed. Prior to the development of optical character recognition technologies, humans transcribed text on book pages by hand to create machine readable renderings of printed content (Coyle 642). Mass digitization projects, on the other hand, involve the photographing of each book page and running OCR programs on these images to produce searchable plain text versions. Early mass digitization projects, such as JSTOR in 1995 and the Million Books Project in 2001, were launched at academic libraries with funding from the Mellon Foundation and National Science Foundation respectively (Guthrie; Reddy and St. Clair). Silicon Valley got involved soon after. The Google books project, code named Project Ocean, began covertly 2002 as a partnership between Google and the University of Michigan when google developers offered to digitize U-M library’s entire collection free of charge under one condition: the library had to sign a nondisclosure agreement. The company did not announce the project publicly until 2004. 7 million books later, Google’s scanning processes, locations of scanning centers, and the names of the people who worked there remain unknown. After the Google Book’s project went public, Internet Archive unveiled a book scanning program of its own (Kaplan “Open Text Archive”). Next, in 2005, Microsoft began scanning books for a project called MSN Book search (later Live Search Books) with its platform going live in December 2006 (Quint). As part of its efforts, Microsoft partnered with IA, supplying the archive with money and equipment for book scanning activities. Through 2007 the number of Google, Internet Archive, and Microsoft scanning programs expanded rapidly before the landscape shifted dramatically in 2008.

Prior to 2008 the major use case for mass digitization-as envisioned by Google and Microsoft-was to provide better answers to search engine queries. However, in 2008, with the shuttering of Microsoft’s project and creation of HathiTrust, digital libraries became predominant. In May 2008, Microsoft abruptly shut down its scanning projects, and Internet Archive inherited the program’s public domain scans, their scanning equipment, and many of their partnerships (Schiffman; Kaplan “Book Scanning”). This left Internet Archive and Google as the only two major organizations involved in book mass digitization projects. Then in October of that same year, the HathiTrust digital library launched containing in it all the digitized books from the Google Books partnered libraries (“Launch of HathiTrust”). Unlike the Google Books and Microsoft Live Search platforms, Internet Archive and HathiTrust make possible the reading of digitized books via a browser and allow the download of plain text versions of them. This increased public and academic access to digitized texts in quantities earlier digitization projects could only dream of.

Prior to the inauguration of mass digitization projects in the early 2000s, much of digital humanities scholarship focused on how to remediate analog documents to enable their computational study. The earliest digital humanities projects (or at least those carried out on an electric computer) took place in the immediate post-War era were computationally aided philological analysis. For example, Father Antonio Busa’s Index Thomisticus, conducted in partnership with IBM, aimed to identify textual patterns in acquina’s work. To do so, Busa worked in tandem with IBM engineers to develop a method for rendering his index in machine readable punch cards for large scale analysis. With the creation of affordable desktop computers by the 1980s, the subsequent establishment of TEI, and the public launch of the world wide web in the 1990s, digital archiving projects boomed (Hockey). In 1991, the University of Virginia began the Valley of the Shadow project in history and then, in 1993, the William Blake Archive in literary studies. For both projects, project teams undertook the collection, digitization, transcription, and encoding of the materials themselves. The Valley team, for example, “began the seemingly endless task of collecting, transcribing, and converting original source material into computer readable files with a few hours of work-study graduate students” after which graduate student, Anne Rubin, lead the conversion of the transcribed plain text into standardized general markup language (“The Story Behind”). In both the project sites for the Valley and Blake projects digitization labor is described not just as arduous but also as worthwhile scholarly labor. Across all three of these early DH projects, digitization of analog documents was considered scholarship.

Thanks in part to mass digitization of book data, a growing number of studies started to leverage computational methods to study literary history at scales that were previously unimaginable. Computer-aided text analysis is a practice dating back to 1945 at least, while the reading of text at macroscopic levels developed independently of computational humanities (Underwood). However, as Ted Underwood suggests, these two fields of macroscopic literary analysis and humanities computing only began to merge around the 2000s with Franco Moretti’s publication of “Conjectures on World Literature” (Underwood). Computationally driven literary history studies at scale picked up by the 2010s with the publications of Graphs, Maps, and Trees (2005), An Algorithmic Criticism (2011), and Macroanalysis (2013). All three of these monographs, but especially that of Moretti are marked by a “view of literary data as factual and transparent” which Katherine Bode attributes to “Moretti’s lack of interest in the scholarly infrastructure that enables his analyses” (Bode 21). Here, Bode is referring to a lack of attention to the processes of collection that determines what books are present in academic libraries in the first place. But also notably absent in all of these texts is any acknowledgement of digitization labor of any kind. Indeed while DH projects of the 1990s use digitization as a primary method (a kind of scholarly labor to be praised), in these distant reading projects, digitization is not scholarly labor or labor of any kind. It is not mentioned at all.

The rise of distant reading that proliferated due to corporate-backed, outsourced mass digitization coincided with the invisibilization of digitization labor and subsequent treatment of digitized texts as surrogates in hegemonic digital scholarship. This shift had to do less with a neglect of scholarly infrastructure, as previous scholars have theorized, and more to do with the modular design of scalable systems that mass digitization projects demanded. For TEI-driven projects, digitization labor oftentimes took place in academic libraries by library staff who were often deeply involved in the projects themselves. In the case of the Edmund Spenser Digital Edition project (which began in 1998), scholars worked closely with archivists across the world to identify all extant copies of first-editions of Spenser’s works. By the time scanning work began, the principle investigators had already built relationships with people at the institutions and in the libraries that housed the documents to be digitized. In contrast, for his 2019 Distant Horizons, Ted Underwood is analyzing texts from the HathiTrust corpus. The codices digitized in the HathiTrust project came from a number of libraries across the US. Almost all were digitized via the Google books program, which agreed to digitize academic libraries’ collections in outsourced, undisclosed locations under non-disclosure agreements. Underwood, unlike the Spenser team, worked with material already digitized in processes that HathiTrust’s metadata fields and Google’s NDAs make unknowable. It’s not just the distant readers like Underwood neglect scholarly infrastructure; it’s that the production of culture at scale obscures infrastructure entirely.

Data archaeology is a method that aims to excavate the infrastructure that digitizations and databases obscure. In her 2014 article on the history of EEBO, Bonnie Mak introduces archaeology as a methodology for elucidating “the discursive practices by which digitizations are produced, circulated, and received” (Mak 1515). While Mak acknowledges—and, in moments, details–that the digitizations in EEBO’s corpus are the product of historical infrastructure and human labor, she is primarily concerned with how these arrangements imbue the digitized text with meanings that ultimately reshape our understanding of the past. Since her 2014 article, other scholars have used archaeology to study the infrastructure, discourses, and politics that produce digital objects (Cordell “Qtjb”; Fyfe; Lee). But like Mak, these scholars are ultimately more concerned with data/objects—not the people most materially impacted by their creation.

More recently, digital humanist, Safiya Noble, has insisted on thinking critically about how DH infrastructure is complicit in systems of global racial capitalism. In her contribution to the 2019 Debates in DH, “Towards a Critical Black DH,” Noble observes that digital humanists have become increasingly interested in digitizing cultural productions of communities that have been excluded from the field without using digital tools to dismantle the systems that perpetuate these communities’ systematic exclusion. And yet, Noble notes, that

We can no longer deny that digital tools and projects are implicated in the rise in global inequality, because digital systems are reliant on global racialized labor exploitation. We can no longer pretend that digital infrastructures are not linked to crises like global warming and impending ecological disasters. We cannot deny that the silences of the field in addressing systemic state violence against Black lives are palpable. Critical digital humanities must closely align with the register in which critical interventions can occur. (Noble)

The critical DH that Noble calls for “foreground[s] a recognition of the superstructures that overdetermine the computerization and informationalization of DH projects, so that those projects can intervene in instances of racial, economic, and political oppression.” Put differently, critical DH goes beyond recognizing its complicity; it seeks to use our methods to dismantle and reimagine the exclusionary systems that make our work possible in the first place. Since Noble’s 2019 intervention, some digital humanists have turned their attention to the extractive infrastructure that enables DH research and tools (Klein 2022), but we have yet to explore how to use digital tools to undermine and even topple extractive DH infrastructure. Scanning at Scale leverages computational data analysis and visualization to read IA book scanning records against Internet Archive’s bills of lading in order to locate the transformative relationships between the nonscalable elements of IA’s scanning infrastructure that make the scanning at scale possible but are necessarily concealed.

Methods

To unearth transformative relationships between books, workers, machines, and outsourced scanning centers, this project reads a scan-date dataset and a bills of lading dataset against each other and uses speculative bibliography to hypothesize on relationships between them. This project uses a different internet archive book dataset than that of the scanning labor project, pulled directly from the open library’s open data dumps. The open library dataset used for the scanning labor project contained 2.5 million books across all 64 scanning centers. However, internet archive’s database contains 37 million books/texts. I needed more accurate counts of the number of books being digitized per month at the four outsourced centers. To do so, I queried internet archive’s server directly and downloaded all the scanning dates associated with records at the four centers of interest. For the Hong Kong, Shenzhen, and China centers the number of records returned corresponds with the number of books scanned at those centers meaning that all the content digitized at these three centers is accounted for in my dataset. However, the Cebu center has quite a few records returned without any scan date, meaning the Cebu dataset is still incomplete. In total, the dataset consists of scanning dates associated with 4.3 million book records digitized across the four scanning centers. I transformed this data from lists of scan dates to counts of books scanned per center per month. See this github repository for detailed information on how I collected and processed the data.

Scanning Center Scans Per Month Dataset

Scanning center	Records
“cebu”	3389045
“hongkong”	480786
“china”	81532
“shenzhen”	402417
total	4353780

Scans per Center Per Month

red = Datum Data Co.; blue = Hong Kong; green = Innodata

Scan dates alone do not indicate much about the places the books were scanned and how they got there in the first place. For more granular information, I turn to bills of lading (BoL). A BoL records an entry of foreign goods into the US through the US trade and customs bureau. Each BoL includes detailed information on the supplier of the shipped goods (company name, address, latitude, longitude), receiver of goods, types of goods, number of containers, and weight of total shipment. This information makes BoLs ideal documents to study for those interested in scalable systems from both a scholarly and corporate perspective. Supply chain engineers who work for multinational corporations use BoLs to inform decisions on where to source suppliers of goods to enhance the efficiency and profitability of a company’s global supply chain. While you can obtain BoLs directly from any US port via FOIA request, this process is expensive. Because of this, proprietary data vendors are the major source of BoL data. Data vendors are able to charge exorbitant prices for these records because multinational corporations can easily afford to pay them. Also due to their primarily corporate use case, data vendors rarely index historic BoLs and available records are usually less detailed than their more contemporary counterparts.

The BoLs for this project came from Import Yeti which aims to provide users with some free Import data and offers premium subscriptions for those interested in bulk downloading records. At the time I purchased a premium subscription for a month, it cost $100. Import Yeti’s data comes directly from the US Port Authority via FOIA request, but it only dates back to 2015. (which only indexes back to 2015). To supplement Import Yeti’s data, I also acquired import records from S&P Global’s panjiva database thanks to a collaborator at an institution with a Panjiva subscription. Panjiva does not contain true BoL because it omits information about vessel names and ports, but it does date back to 2011. I deduplicated Panjiva and Import Yeti records based on arrival date and container weight. I wound up with a total of 83 unique BoLs from 2011 to 2023. The scripts used for deduplication and the csv file tracking manual deduplication decisions are all available via my github. For this project, I’m especially interested in the HS code, supplier name and location, and weight of goods shipped.

To excavate the relationships between the scanning dataset and the BoL dataset, I conduct a speculative bibliographic data visualization. Speculative bibliography, coined by Ryan Cordell, is a computational method to leverage digital archives’ affordances while making visible connections their architecture may obscure. Broadly, bibliography relates to the study of texts (their creation, dissemination, preservation, and loss). Bibliographers use classification methods to elucidate connections between texts. According to Ryan Cordell “ speculative bibliography comprises computational and probabilistic methods that map relationships among documents, that sort and organise the digital archive into related sets.” In this case, I am interested in a relationship between ‘texts’ –bills of lading and scanning records– that bear witness to the human beings who actually transported the books to the scanning center, scanned them, and then returned them to where they came from initially. The relationship between texts that I am interested in is merely a proxy to a material-spatial relationship between human workers and the scanning infrastructure in which they play a crucial role. To realize these probabilistic connections, I rely on visualizing the datasets together. My data visualization programs can be understood as speculative bibliography in the sense that “a program becomes speculative bibliography when its action operationalises a theory of textual relationship” (Cordell, 2020, p. 524). To operationalize my theory of texts, I rely on data visualizations instead of text-based algorithms Cordell is using in his article. This involves plotting seemingly disparate datasets together to visualize potential connections between shipping and scanning activities.

Through visualizing probabilistic connections between texts and imaging the material-spatial relationships they may represent, I was able to discover 4 separate scanning centers Internet Archive operated directly (or via contracted labor) overseas from 2011 to the present. Through analyzing HS codes, I found that IA has received shipments of books from 5 locations and 4 suppliers. One of these suppliers is better world books which sells used books at a low cost. Therefore, i speculate that the better world books location is not a scanning center and is, therefore, out of scope. The other four scanning centers correspond with 3 suppliers that operated scanning centers at four separate locations over time: Datum Data Co., responsible for the Shenzhen and China scanning centers; Innodata Knowledge services, responsible for the Cebu center; and Internet Archive China, a scanning center that Internet Archive set up without a BPO partner that maps to the Hong Kong scanning center. In what follows, I use visualizations to further explore these.

Shipments Between IA and its Suppliers

click on items on the legend to visualize a connection individually

Outsourced Scanning Centers

Datum Data Co. Ltd.

On July 22nd 2011, an 8200 kg shipping container carrying 420 boxes of books arrived in San Francisco, California. For the thousands of books within the container, this was at least their second journey across the Pacific ocean. Months earlier, they had probably left this port for China to be digitized. Their destination: a business process outsourcing firm, Datum Data Co., in building number 2 of Wanli Industrial Park in Shenzhen. Workers there turned every single page of every single book and photographed them. After carefully re-processing the photographs of the pages to ensure they all appeared straight and flat, a digital facsimile of each book was uploaded from a computer in Shenzhen to a server in San Francisco. Now, back in California, the codices were reunited with their digital counterparts in a former Church at 300 Funston Ave. These 420 boxes of books were likely not the first to return to IA’s headquarters from Datum Data’s Shenzhen scanning center. However, due to the limited nature of historic bills of lading data, this is the first shipment between the two parties of which I am aware. Likewise–as far as I know based on BoLs, scanning center names, and IA’s 990s–the Datum Data Co. scanning center in Shenzhen represents Internet Archive’s first attempt at outsourcing book scanning labor.

Datum Data Co. Scanning Center Scans per Month & Shipments

Datum Data Co. was founded in Shenzhen, China in 1996. Initially, the data processing company began by offering image scanning and OCR-ing services (“Image Capture”). The company’s partnership with Internet Archive probably began in July 2009 when a scanning center in IA’s data called ‘shenzhen’ uploaded 7 books over the course of the month. Later records from the shenzhen center list “Datum Data” in the partner field. Moreover, videos tagged with the ‘shenzhen’ scanning center detail a scanning center run in partnership with Datum Data Co (Miller, “shenzhenscann2011”). Scanning activity at the center picked up in June 2010 after Internet Archive rented an industrial warehouse facility near the port of Oakland. Robert Miller, IA’s director of global book digitization, uploaded a video touring an Oakland warehouse at 7001 San Leandro Blvd. in April 2010. In it, he discusses storing shipping containers of books weighing more than 5 tons in the facility (Miller, “archivewarehouse”). Equipped with an overseas partner and industrial space to store hundreds of thousands of books, IA began digitizing books en masse.

From 2010 through 2016, the vast majority of books added to IA were digitized at the Shezhen center. A 2011 video tour of the facility reveals it was furnished with 12 scribe machines and 8 reloading stations all housed in two small rooms (Miller, “shenzhenscann2011”). Scanning activity reached an all time high at the center in July 2012 when workers scanned 24,487 books in just one month. After this, scanning activity continues but never surpasses that pace. The last month scans from the center were recorded is December 2015 with 3243 books. In total, the Shenzhen center located in the Wanli Industrial building scanned 402,417 books. After 2016, when scanning activity declined in Shenzhen, it was displaced by the Hong Kong center. While Datum Data still scans books on behalf of the Internet Archive at another Shenzhen location, the vast majority of these are from the Chinese popular books project. These books seem to be sourced from China, not the US, so shipments between Datum Data and IA never pick back up to the pre-2016 levels. The partnership with Datum Data at this separate location is ongoing, and to date workers there have scanned over 81,532 books.

A tour of the Shenzhen center in 2011

A tour of the Oakland warehouse in 2010

Why the Shenzhen center shut down in December 2015 is not immediately clear, but it may correspond with Robert Miller’s, director of the book scanning program, departure. From 2005 to 2015, Robert Miller worked at Internet Archive as the director of the book program, joining IA “ to help create a mass movement of libraries bringing themselves digital by scanning books, microfilm, and other media” (Kahle). During his tenure, Miller created partnerships with over 30 libraries across the world and as a result IA and its partners digitized 2.5 million books and texts (Kahle).

The partnerships that Miller created hinged upon personal connections and, as such, represent nonscalable forms that IA cold not replicate in his absence. Unlike a scalable form, nonscalable forms cannot merely be multiplied without need for change. IA could not simply open another scanning center using the same model as Miller because the success of these programs, Shenzhen included, were related to Miller’s personal connections. Miller uploaded photographs and videos documenting his many trips to China to set up, train workers at, and check in on the Shenzhen center. Among these: a series of videos of a woman teaching Miller mandarin to improve his communication with his Datum Data partners, videos of a celebratory dinner with Datum Data’s CEO and staff, footage of Miller touring the city Shenzhen and trying local cuisine, and images and videos of the scanning center itself. In these videos of the scanning center, Miller refers to scanning workers by name. While he refers to the center as the ‘Shenzhen center,’ more often than not in his narration he calls it “Ken’s center,” in reference to the Datum Data Co. staff member overlooking the center’s day-to-day operations.

Nonscalable systems, of course, are not necessarily equitable. The Shenzhen center that Miller set up is no exception. IA’s scribe machines were still non-ergonomic, labor in Shenzhen was still cheaper than in the US (which probably led to the decision to outsource to begin with), and the environmental toll of shipping hundred of thousands of books across the Pacific merely to be able to pay someone less than a living wage is inexcusable. But, perhaps, the Shenzhen center, and IA’s first attempt at outsourcing, is not quite representative of the scalable system that it would eventually become. Indeed the 482,000 books digitized in Shenzhen over five years, and the 2.5 million under Miller’s 10 year tenure, is extraordinary, but it pales in comparison to the 5 million digitized in Cebu in 5 years alone.

Hong Kong Scanning Center

After scanning activity at the Shenzhen center declined in 2015 and Robert Miller moved on, activity at a scanning center called “hongkong” surpassed that at all others. There is a center called ‘hongkong’ in Internet Archive’s data predating the Shenzhen center’s closure. There are books associated with that center scanned, first, in August 2009 (one book) and then again in summer 2013 (300 books). But May 2016 marked the start of sustained scanning at the center. Nine months later, the first recorded shipment of books from a scanning center at Veristrong Industrial Center in Hong Kong entered the US via the port of Oakland on February 12, 2017. The 10390 kg shipping container carried 480 boxes of books loaded on 24 palettes all of which would be transported to a Richmond, California warehouse for storage.

Hong Kong Scanning Center Scans per Month & Shipments

Unlike its Shenzhen predecessor, the Hong Kong scanning center was not affiliated with any business process outsourcing firm. No business process outsourcing firm appears on any of IA’s 990s from the period (aside from the already accounted for Datum Data), and the bills of lading associated with this Hong Kong location list IA as both the supplier and receiver of goods. While this could be a coincidence, anomalies in the metadata suggest that IA operated this scanning center on its own behalf. IA included a field for shipping_container numbers for some of the books scanned at this center for the first 1.5 years of sustained scanning activity (May 2016 through October 2017). In addition to tracking the shipping container, IA also tracked the donors of the books during the same period.

Hong Kong Scanning Center Scans per Month & Known Donors

Boston public library donated 62,692 books scanned at the center, accounting for the vast majority of the donations from this period. By 2016, Boston public library already had a longstanding relationship with internet archive and an onsite scanning center. Other book donors include libraries that, like Boston, already had in-house scanning centers with IA such as Allen County and the University of North Carolina. Other donors such as the Marin County Free Library also had worked with IA to digitize books before the Hong Kong center’s creation, but the books were mostly scanned offsite. In the case of Marin County, workers scanned some of their books at the Shenzhen center and at the San Francisco downtown regional scanning center. Aside from public libraries (which make up the vast majority of donors), used book vendors, Better World Books and Alibris, donated some books for digitization. Internet Archive also accepted donations from individuals that it shipped to the Hong Kong center for digitization.

Libraries that already had in-house centers’ decisions to donate books for digitization overseas are especially perplexing. Through conversations with Internet archive staff at the university of Illinois in 2022, we know that in-house IA scanning centers charge partner libraries 10 cents per book digitized. Assuming this rate was the same in 2016, Boston Public library could have used its in-house center to digitize the 62,000 books for $6,269.20. However, these books–likely weeded from their collections–were probably not worth that price of digitization. Instead, librarians decided to donate the books to the Internet Archive for digitization over 12,000 miles away.

After October 2017, IA ceased tracking donor information as consistently but scanning activity continued and even increased for another year. The Hong Kong center stopped scanning suddenly. Workers scanned the last books at the location in December 2018. Workers at the Hong Kong center digitized 480,786 books over its brief three years in existence. Then in February 2019, the final shipment from the location arrives in the US. It contained scanning equipment and books, indicating the center shut down. Why the Hong Kong center was started in the first place when IA already had an established relationship with Datum Data, held partnerships with donating libraries, and shut down so suddenly remains unknown. The bills of lading and scanning data alone simply cannot tell us this story. However, through connecting the two data sources, I was able to uncover an organization of scanning labor the Scanning Labor project had previously neglected and identify new avenues of analysis.

Innodata Knowledge Services

Internet Archive replaced the Hong Kong center with a new scanning center located in Mandaue City, Philippines. Operated in partnership with a US-owned business process outsourcing firm–innodata knowledge services, workers at the Philippines based scanning center are responsible for the digitization of the vast majority of Internet Archives’ text archive today. Both the weight of shipments from the Innodata center and the rate of books scanned far outpaces both the Hong Kong and Shenzhen centers. Based on the bills of lading and scan date data, it is not possible to glean much information on this scanning center. The Scanning Labor team has already identified this center and researched it extensively. From this work, we know that the average productivity of workers at this center is far greater than that of workers at any other center. Likewise, turnover is higher on average, and Internet Archive rarely acknowledges this center.

Innodata Scanning Center Scans per Month & Shipments

However, the bills of lading data do suggest that Internet Archive had to reorganize its warehousing infrastructure to cope with the rate of and amount of books scanned at the center. The first shipment from the Innodata center entered the US via the port of Oakland on March 25, 2019. It was a single shipping container weighing 19,990 kg. Subsequent shipments are of a similar weight and occur on a monthly, sometimes fortnightly basis. About a year after the center opened, in May 2020, Internet Archive stopped importing books from the Philippines via the port of Oakland. Instead, shipments arrive via Baltimore, Savannah, and Newark/New York. At the same time, the supplier address Internet Archive lists on its bills of lading changes from the Richmond, CA warehouse on Carlson street to a Latrobe, Pennsylvania warehouse later called “Open Library Headquarters.” I am not sure why the Internet Archive started storing books in Pennsylvania instead of Richmond. Perhaps it was cheaper to rent a facility in Pennsylvania than it was anywhere in California. Or, perhaps the relocation was in part informed by the COVID-19 pandemic and supply chain disruptions.

The rate and weight of shipments shifted in 2021. There were no bills of lading recording shipments between IA and the Innodata-run scanning center from August 28, 2021 to December 2022. During this period, workers continued to scan at rapid rates. The shipments from 2023 are significantly heavier in weight than their earlier counterparts, ranging from 50,000 - 100,000 kg. This may indicate that Innodata or Internet Archive rented a warehouse for storage in Mandaue City to reduce the frequency and cost of shipping activity.

The scanning center in the Philippines remains in operation, and so far workers have scanned over 5 million books.

Next Steps

There is much work to do on this project going forward. Thanks to receiving a small grant from and the opportunity to publish with the Manifest project, the next major phase of this project will be to create a geospatial visualization of the bills of lading data. As part of this, I anticipate scraping more metadata, analyzing donor and genre data, supplementing the bills of lading with export information, and most importantly finding ways to center the human and environmental cost of mass digitization in the visualization we produce.

Currently, my IA data from the Shenzhen and Cebu centers consists exclusively of scan dates. For Hong Kong, I have both the scan date and donor information for every book. However, I have not scraped any book metadata, sponsor information, or data about worker names and scribe machines. This is due to the large size of the datasets and IA’s API throwing errors. Going forward, I will scrape all IA ids of the books in each scanning center’s collection and develop a Python program to automatically query a record based on the id field. This should reduce time out errors because I will be handling each book separately. The development of a local SQL database may be necessary to access and manage this amount of data. From this scraped data, I would like to further analyze donor fields of all scraped books to see if any books from the other centers have any donor information recorded. Following that, I will geocode all donor data. When donor data is not available, studying genre and publication information may roughly indicate the kinds of books digitized at a center and where they may have come from.

A major flaw of this current project is that I am only using import data. I would like to first look into accessing S&P global’s PIERS database which has some international import data. Second, if i cannot find import records for China, Hong Kong, and the Philippines, I will develop a logistic regression model to predict when a shipment of books might have arrived at the scanning center based on the weight of the previous shipment and rate of scanning activity. Finally, using the supplemented metadata, I would like to calculate the number of workers who worked at each scanning center, the average turnover rate, etc. using the methods Lucian developed for the scanning labor project. We will also consider translating the survey and distributing it via social media websites commonly used in China and the Philippines. Also using the bills of lading, I would like to calculate the monetary and environmental costs of IA’s mass digitization projects.

Conclusion

At the beginning of this semester, David, Lucian, and I met to discuss future directions for the Scanning Labor project and how we hoped our individual projects for Culture at Scale may complement these. We wanted to create something similar to Kate Crawford and Vladan Joler’s “Anatomy of an AI System” project but for digital humanities projects that use IA digitized book data. Lucian would focus on identifying which DH projects used IA data, how they used them, and how to identify the books those projects used. I would use bills of lading data to sketch out the infrastructure undergirding the outsourced scanning centers we had identified the previous semester. David planned to analyze IA’s blog as well as how academics described IA in articles in order to situate our Anatomy of a DH Project within its discursive context.

I cannot speak for David and Lucian, but my part of this project has, in many ways, failed. It is simply not possible with the two datasets I have to trace the creation of a particular IA digitization from the library the codex was selected from, to IA’s warehouse, to shipping container, to port, to scanning center, to scanning worker, to scribe machine, and back to an IA warehouse for storage. My data sources obscure these connections and with them the human workers who created the digitizations, their working conditions, and the environmental implications of digitizing en masse. I have tried to make visible the spatial and material realities I can account for. I name contractors, cities, the weights of cargo containers, and the locations of scanning centers. Even still, I cannot say with much about the connections between these, human workers, and the books themselves with any degree of certainty.

And yet, I believe speculation and the uncertainty it produces is in and of itself valuable. Oftentimes we as digital humanists consider something a valuable contribution to scholarship when an author is able to make a claim to certainty, or at least achieve a high level of confidence. Speculation precludes certainty and holds space for absence, while attempting to acknowledge that absence within our data is not coincidental–it’s produced. The transformative relationships I wanted to recover between worker, machine, book, and scanning center are inevitably probabilistic and ultimately, unknowable with data alone. And yet acknowledging the gaps culture at scale has produced that data analysis alone cannot fill in is worthwhile scholarship. Indeed, telling only the stories we can be certain of risks reifying the hegemony of the dataset and further marginalizing the people whose stories it did not–could not–capture. Speculative data analysis displaces the valuation of confidence intervals with care. Care for human workers, care for our planet, and care for our communities.

Works Cited

Archivewarehouseoakland. Directed by Robert Miller, 2010. Internet Archive, http://archive.org/details/archivewarehouseoakland.

Bode, Katherine. “Abstraction, Singularity, Textuality: The Equivalence of ‘Close’ and ‘Distant’ Reading.” In A World of Fiction, 17–36. Digital Collections and the Future of Literary History. University of Michigan Press, 2018. https://www.jstor.org/stable/j.ctvdtpj1d.5.

Cordell, Ryan. “‘Q i-Jtb the Raven’: Taking Dirty OCR Seriously.” Book History, vol. 20, 2017, pp. 188–225, https://doi.org/10.1353/bh.2017.0006.

Cordell, Ryan. “Speculative Bibliography.” Anglia, vol. 138, no. 3, Sept. 2020, pp. 519–31, https://doi.org/10.1515/ang-2020-0041.

Coyle, Karen. “Mass Digitization of Books.” The Journal of Academic Librarianship, vol. 32, no. 6, Nov. 2006, pp. 641–45. https://doi.org/10.1016/j.acalib.2006.08.002.

Crawford, Kate, and Vladan Joler. “Anatomy of an AI System.” Anatomy of an AI System, 2018, http://www.anatomyof.ai.

Fyfe, Paul. “An Archaeology of Victorian Newspapers.” Victorian Periodicals Review, vol. 49, no. 4, 2016, pp. 546–77.

Guthrie, Kevin M. “JSTOR: Large Scale Digitization of Journals in the United States.” LIBER Quarterly: The Journal of the Association of European Research Libraries, vol. 9, no. 3, 3, May 1999, pp. 291–97, https://doi.org/10.18352/lq.7546.

Hockey, Susan. “History of Humanities Computing.” In A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Blackwell, 2004. https://companions.digitalhumanities.org/DH/

“Image Processing.” Datum Data Co., Ltd: Data Entry, Data Capture, Data Processing, http://www.datumdata.com/ImageCapture.aspx. Accessed 15 May 2023.

Kahle, Brewster. “Thank You, Robert Miller, for 2.5 Million Books for Free Public Access.” Internet Archive Blogs, 8 May 2015, https://blog.archive.org/2015/05/08/thank-you-robert-miller-for-2-5-million-books-for-free-public-access/.

Kaplan, Jeff. “Books Scanning to Be Publicly Funded.” Internet Archive Blogs, 26 May 2008, https://blog.archive.org/2008/05/26/books-scanning-to-be-publicly-funded/.

Kaplan, Jeff. “Open-Access Text Archives.” Internet Archive Blogs, 15 Dec. 2004, https://blog.archive.org/2004/12/15/open-access-text-archives/.

Kenscanningcenter201210. Directed by robert miller, 2012. Internet Archive, http://archive.org/details/kenscanningcenter201210.

Klein, Lauren. “Are Large Language Models Our Limit Case?” Startwords, no. 3, Aug. 2022. Zenodo, https://doi.org/10.5281/zenodo.6567985.

“Launch of HathiTrust - October 13, 2008.” HathiTrust Digital Library, 13 Oct. 2008, https://www.hathitrust.org/press_10-13-2008.

Lee, Benjamin. “Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset.” Digital Humanities Quarterly, vol. 15, no. 4, 2021, http://www.digitalhumanities.org/dhq/vol/15/4/000578/000578.html.

Loewenstein, Joseph et al. The Spenser Archive Prototype. https://talus.artsci.wustl.edu/spenserArchivePrototype/. Accessed 15 May 2023.

Mak, Bonnie. “Archaeology of a Digitization.” Journal of the Association for Information Science and Technology, vol. 65, no. 8, 2014, pp. 1515–26. Wiley Online Library, https://doi.org/10.1002/asi.23061.

Noble, Safiya Umoja. “Toward a Critical Black Digital Humanities.” Debates in the Digital Humanities, University of Minnesota Press, 2019, https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/5aafe7fe-db7e-4ec1-935f-09d8028a2687.

“Plan of the Archive.” The William Blake Archive, https://www.blakearchive.org/staticpage/archiveataglance?p=planNEW. Accessed 15 May 2023.

Posner, Miriam. “See No Evil.” Logic Magazine, 1 Apr. 2018, https://logicmag.io/scale/see-no-evil/.

Quint, Barbara. “Microsoft Launches Book Digitization Project—MSN Book Search.” Info Today, 31 Oct. 2005, https://newsbreaks.infotoday.com/NewsBreaks/Microsoft-Launches-Book-Digitization-Project-MSN-Book-Search-16090.asp.

Reddy, Raj, and Gloriana St. Clair. “The Million Book Digital Library Project.” Raj Reddy, 2001, http://www.rr.cs.cmu.edu/mbdl.htm.

Schiffman, Betsy. “Microsoft Gives Up on Book Search.” Wired, May 2008, www.wired.com, https://www.wired.com/2008/05/microsoft-cans/. Accessed 13 May 2023.

Shenzhenscann2011. Directed by Robert Miller, 2011. Internet Archive, http://archive.org/details/shenzhenscann2011.

Tsing, Anna Lowenhaupt. “On Nonscalability: The Living World Is Not Amenable to Precision-Nested Scales.” Common Knowledge, vol. 18, no. 3, Aug. 2012, pp. 505–24. Silverchair, https://doi.org/10.1215/0961754X-1630424.

Underwood, Ted. “A Genealogy of Distant Reading,” Digital Humanities Quarterly, vol 11, no. 2, 2017. http://www.digitalhumanities.org/dhq/vol/11/2/000317/000317.html#heuser2012

Underwood, Ted. Distant Horizons: Digital Evidence and Literary Change. The University of Chicago Press, 2019.

“The Valley of the Shadow: The Story Behind the Valley Project.” The Valley of the Shadow, https://valley.lib.virginia.edu/VoS/usingvalley/valleystory.html. Accessed 15 May 2023.

Github Repository