Notes on analytical engineering
Software and method for researching AI and Lifelong Learning
This content was originally posted on the Oxford Internet Institute blog. I’ve lightly edited it here to streamline for readability and reflect changing terminology in the field.
Table of Contents
Beginning with ends in mind
Following our introductory post about AI and lifelong learning, we wanted to focus on the technology stack we’re using to conduct our research. We thought this topic made sense now for two reasons:
To provide a single point of reference for anyone who wishes to engage with the software and code described in future publications; and
to be transparent about—and reflect upon—the role of software in shaping the research process itself.
For example, we chose early on to incorporate machine learning techniques into our research: to address a research challenge we encountered, but also to gain firsthand experience using AI for augmenting human intelligence. If we’re going to assess claims about how AI can be used for learning, after all, it seems sensible for us to gain experience applying AI in this space ourselves!
Roadmapping the data analysis process
Again, we’re interested in mapping not only the breadth of discourse about AI and lifelong learning, but especially the underexplored relationships between these subjects. This makes for a lot of material to potentially review, much of it from related but distinct communities.
As a result, the object of our analysis came into focus quickly: documents such as journal articles, press releases, social media, and unstructured web content. Tools and methods for analyzing documents took more time and experimentation to develop. As of today, we identify three phases of analytical foci and corresponding tools:
“Out of the box” tools for scientometric analysis of structured document metadata;
A common Python natural language processing (NLP) pipeline for analyzing semi-structured document data; and
A bespoke graph database platform (knowledge graph) for network analysis of unstructured document data.
We’ll share the output of these analyses per se in future posts. For the rest of this post, we focus on what tools we chose in each phase, how, and with what lessons learned.
To help organize each phase, I’ve found it helpful to distinguish between two sets of data management tasks:
Data collection and storage: How are data gathered? Once downloaded or captured, how is it stored over time to support visualization and analysis? For example, should it be loaded into a database system, or stored directly in memory as a Python object? How will I share it, protect it, back it up?
Data visualization and analysis: Anscombe’s quartet teaches us that data visualization belongs square in the middle of analysis efforts—not just as a tool for communicating findings at the end. So, what capabilities are available for visualizing and analyzing data throughout the process? How can these capabilities be adapted, modified, recombined, etc.?
For each phase, I first present tools by sub-task, then discuss how all tools fit together within the phase. Every tool we used was free and/or open-source software, and (with a little patience) can all run on Windows, Mac, and Linux. I encourage you to check them out if you haven’t already!
Phase 1: Bibliometric data analysis of academic publications
We started with academic publications, which benefit from readily available data as well as interpretive standards established through scientometrics, the quantitative study of research.
Phase 1 data collection and storage tools
- Harzing’s Publish or Perish: Bulk data collection tool created by Anne-Wil Harzing, now in its 6th release, with support for Google Scholar, Microsoft Academic, and Crossref. Also helpful for calculating standard bibliometric scores, saving and comparing searches.
- JabRef: Bibliographic data management tool focused on managing entries as a BibTeX file. Features for deduplicating entries.
Phase 1 data visualization and analysis tools
- VOSviewer: Visualization tool for bibliometric networks as well as keyword/term network visualization. Makes it easy to start finding topical patterns based on abstracts. Some features dependent on file format of data input files.
Phase 1 tool discussion
We first targeted bulk data collection of bibliometric metadata, such as article and journal titles, authors, keywords, etc., using Harzing’s Publish or Perish, a bulk bibliographic metadata collection tool. The bibliographic data manager JabRef allowed us to merge data files from different sources into a single dataset using the BibTeX standard. We also used JabRef to de-duplicate entries as much as possible—a significant challenge given sometimes thousands of overlapping entries from multiple databases. Finally, VOSviewer allowed us to visualize patterns of usage of terms found in article abstracts.
Though quick and easy to get these “out of the box” tools working, the approach presented three main drawbacks:
Duplicate records. JabRef’s deduplication feature didn’t quite scale to tens of thousands of documents.
Support for citation network analysis. The representation of citation data was too inconsistent and sparse across the dataset to allow us to use VOSviewer the way we hoped.
Support for subsetting data. We wanted to be able to visualize subsets of data, but the process was labor-intensive with JabRef. We also wished for better presrevation of provenance within the dataset, such as database and search term of origin.
In short, while Phase 1 proved out our basic approach for triangulating discourse, the actual process of downloading from multiple data sources, compiling into a monolithic file in JabRef, then exporting to VOSviewer, was too error prone and labor intensive to be sustainable, given our goals. For example, if there are distinct topic “networks” emerging, what disciplines are the source articles/journals in? Do journal disciplines correspond to topic networks? Etc.
For Phase 2, we attempted to streamline these steps from a process designed around a monolithic dataset and analysis step, into more of an iterative search process involving permutations of the dataset itself and of analyical techniques applied to it.
Phase 2: Natural language analysis of academic publications
We turned to topic modeling and text classification techniques, paired with specialized visualizations and a bespoke “report generation” approach, to develop a more qualitative and expressive analysis of the topic space.
While we did reuse the BibTeX data standard from Phase 1, we shifted to using the Python 3 Anaconda Distribution to take advantage of multiple community packages dedicated to various sub-tasks in the phase. The tools below correspond to these packages.
Phase 2 data collection and storage tools
- Python – bibtexparser: Adds helper functions for importing and exporting BibTeX files.
- Python – Pandas: De facto standard for adding capabilities for working with tabular data using a “dataframe” concept. Includes helper functions for import, export, and manipulation, including filtering/subsetting of data.
- Python – NLTK: Natural Language Toolkit (NLTK) implements tasks in natural language processing at a fine level of granularity and control.
- Python – scikit-learn: Most popular “SciPy Toolkit” collecting production-class implementations of machine learning algorithms, wrapped in an elegant “pipeline” framework that allows for easy experimentation and configuration of ML workflows.
Phase 2 data visualization and analysis tools
- Python – seaborn: Data visualization library designed to streamline and extend the features of matplotlib, a common data visualization library for Python inspired by MATLAB.
- Python – python-docx: Toolkit for creating and editing Office Open XML Document documents, AKA Microsoft Word documents.
- Python – pyLDAvis: Generates interactive visualizations of latent Dirichlet allocation (LDA) topic models using HTML. Implements an R package, LDAvis.
Phase 2 tool discussion
In our shift from article metrics to article abstracts, we wanted to ensure we could rapidly explore the parameter space of multiple dimensions in combination with each other:
Article provenance, such as search term used to obtain the article
NLP tasks (such as topic modeling) and approaches (such as distinct algorithms for performing topic modeling)
Parameters of specific approaches (such as number of topics the topic modeling approach should create)
To achieve this, we designed a pipeline of processing steps in Python that allowed us to tweak parameters at multiple points in the pipeline and quickly assess impact using interactive and static outputs. We did this by first bringing in bibliographic data using bibtexparser, then converting it to a Pandas dataframe for further processing. For example, Pandas allowed us to apply regular expression matching to filter and subset data.
After this filtering step, a “create topic report” function applies a series of transformations to the data before outputting an interactive topic model visualization using pyLDAvis, as well as a detailed Word document report (created with python-docx) for each topic containing the following:
- List of topics with descriptive statistics such as number of articles per topic, and distribution of topics per year, visualized using seaborn
- For each topic, listing of the top n articles within that topic group, including title, authors, journal, and abstract
To arrive at these report outputs, a number of NLP tasks are chained together. NTLK removes stopwords and lemmatizes article abstracts. Next, data is copied across two parallel processing flows using scikit-learn. For each flow, text data is converted into into a matrix representation suitable for quantitative and statistical analysis by topic modeling algorithms. These included term frequency-inverse document frequency (tf–idf) to support latent Dirichlet allocation (LDA), as well as term frequency to support Non-Negative Matrix Factorization (NMF).
For both LDA and NMF data flows, each document receives a score describing how well it fits within each of a given number of topic groups. Rather than settle on a single, “correct” number of topics, we wanted to explore the effect of varying topic number to see what patternsemerged. Therefore, for each time the pipeline is run with a given data input, it iterates across multiple values for topic number, creating a distinct report file for each group count, e.g. reports for 3, 5, 7, 10, 20, and 40 topics. By reviewing reports across the parameter space of topic numbers, we could then isolate topic groups that were especially unique; persistent across topic numbers; or else irrelevant to our cause. This last category enabled us to prune irrelevant articles from our dataset, then re-iterate to analyze again.
This qualitative, exploratory approach to data analysis would not have been possible with out of the box tools. At the same time, while helpful in enabling us to grasp the breadth of topical foci within the space, it was less clear how to understand the social situatedness of topics, articles, or journals. For this, we turned to a more ambitious software pipeline in our third phase.III.
Network and natural language analysis of social media
In Phase 3, we wanted to take advantage of recent advances on a NLP task known as “entity recognition” to build out a network analysis of activities occurring in social media related to AI and lifelong learning. That is, by collecting news articles; blogs/microblogs; and possibly the academic articles we had collected in Phases 1-2, we wanted to develop a semi-automated way of identifying what was being discussed in the various articles, as well as what social actors we could therefore deduce or infer were collaborating in some way. We envisioned a graph database system to serve as a knowledge base for tracking these insights, with a data collection and analysis system on top of this database to help populate it.
This phase is a work in progress, so this list is subject to change!
Phase 3 data collection and storage tools
- Neo4j (Community Edition): Graph database management system with a mature ecosystem of development tools, such as a Python driver (Py2neo); dedicated query language (Cypher; and helper tools (Awesome Procedures On Cypher AKA APOC). Readily available documentation.
- Graphileon Interactor (Community Edition): Visual interface for Neo4j that provides ad-hoc querying and data editing within a visual web interface.
- Python – feedparser: Extracts structured data from RSS, ATOM, and other syndication feeds.
- Python – Newspaper3k: Extracts structured data from websites containing serialized data in a “news article” format. Complements feedparser.
Phase 3 data visualization and analysis tools
- Python – spaCy: Newer NLP framework with more streamlined as well as advanced functionality than previous combination of NLTK + scikit-learn. Ships with trained models that achieve state of the art performance in multiple NLP tasks. Good documentation.
- Cytoscape: Graph/Network visualization software. Interfaces with Neo4j.
Phase 3 tool discussion
For Phase 3, the vision is to use Python tools like feedparser and Newspaper3k to collect data from the open web. This will be stored within a Neo4j database in such a way as to preserve provenance (data source) represented as graph connections between data sources and documents. Using spaCy, we can then use named entity recognition (NER) functionality to identify nouns such as companies, products, and locations, from article texts. These can also be represented as distinct “entity” graph nodes connected to document nodes. By analyzing within- as well as across-document mentions of specific products, companies, etc., we can identify a “collaboration network” within the space. Challenges related to pruning, enhancing, or creating data entries in the database are met by providing Graphileon Interactor to the general research team, since this tool provides ad-hoc querying and data creation/delete/editing capabilities. Finally, Cytoscape provides more advanced capabilities around visualizing and analyzing the graph itself.
We are actively working on this phase and hope to dedicate a future blog post to its progress.
Reflecting on analytical engineering
Over the course of these three phases of research, it has struck me that although our choice of tools has always been lead by research questions, so too have our questions been lead by technical capabilities. This give and take of course also characterizes the application of AI for lifelong learning: certain tasks are becoming increasingly efficient and effective for computers to perform—but what are the “right” applications of AI techniques, on balance? (“Where is the knowledge we have lost in information?”)
For us, an important part of navigating this question has been to ensure we maintain a “human in the middle” approach to our use of machine learning and other computational techniques. By this, we mean more than just the “art” or pragmatic dimension of applying unsupervised learning techniques like topic clustering. Rather, we mean that qualitative checks like our Phase 2 topic reports were important to ensure, regardless of the technical performance of processing steps per se, that we had opportunities to leverage (and develop) our own intuition, creativity, and expertise in the space to make further decisions and conclusions. Though I risk anthropomorphizing AI by saying so, I’m inclined to characterize this as a “partnering with” relation to AI technologies, rather than a “hand off work to” relation.
As the project has evolved, we’ve also answered the concerns raised in this post by pursuing two other branches of activity: a more conventional literature review and synthesis of policy related to AI and lifelong learning, and a plan for case studies applying a more ethnographic approach to studying the use of AI for lifelong learning in situ. We hope you’ll join us as we present these and more over coming weeks and months.