Research Series: Publishing a report about doctor web

As we come to the end of the first phase of doctor web (data retrieval web engine: https://pypi.org/project/dr-web-engine/), with the project being fully open sources and published for end user consumption on the Docker Hub repository and Python Package repository, we want to make sure we can share the motivations behind the project, the advantages and novelty it brings to the community and comparison with other similar projects. For that purpose, we set to writing a report that has been submitted for pre-print on arxiv.org, and we will be looking to iteratively improve and submit for peer review to adequate journals.

doctor-web-preprint

Literature review

We reviewed literature from two aspects: Web Queryability and Web Scraping

The two concepts are fundamentally different. The former deals with the structure of web pages and tentatives to gradually move from the unstructured nature of the hyper text to web pages designed on top of structures like RDF or its more modern evolution JSON-LD. While these technologies would solve the problem of web data queryability (i.e. the structures are well-defined and addressable with existing query languages like RDF Query Language, JSON Path, JSON Query). This problem remains largely unsolved, and the adoption of these technologies has lacked progress and failed the initial predictions for moving to structured data for web pages. Technologies aside, the content of property web pages is seen as a valuable asset and there is no grand motivation to make it more easily accessible without losing control of the data. On the other hand, offering access via APIs allows more control over data and better opportunities’ monetisation detaching the web content (what users see) from the data

This leads to the quest for the second aspect; being able to retrieve the information that end users see directly from the web pages. The concept is not new and in the literature review we bring some of the metadata analysis of different web scraping methods and strategies. Here again, our focus is on queryability. Rather than analysing the many packages and tools (e.g. Scrapy) that allow programmers to build web scraping scripts, we analysed the literature for efforts to providing a higher level to solve for querying unstructured web pages and extract data into structured or semi-structured representation (e.g. JSON). In fact, I had already dealt with this challenge in the past during my doctoral research and found that OXPath was addressing just the same problem. However, OXPath had become a legacy using old technologies and lacking community engagement and improvements.

Hence, we set on the quest to build upon the OXPath research with the aim to (a) modernise the query language and support more modern technologies (i.e. rather than a non-retro-compatible extension of XPath, build on JSON5, YAML maintaining language retro-compatibility [see footnote 1]) (b) modernise extendability and focus on open-source and community contribution, and (c) address the dated technologies and support multiple underlying engines for extraction focussing on improving speed, memory and processing power usage.

While we addressed the aspect of building doctor web in this (b)log, in the research paper included above we took some steps to benchmark and compare execution times, memory usage and compute power usage between doctor web and OXPath.