Research Series: Publishing a report about doctor web Ylli P, February 19, 2025April 16, 2025 As we come to the end of the first phase of doctor web (data retrieval web engine: https://pypi.org/project/dr-web-engine/), with the project being fully open sources and published for end user consumption on the Docker Hub repository and Python Package repository, we want to make sure we can share the motivations behind the project, the advantages and novelty it brings to the community and comparison with other similar projects. For that purpose, we set to writing a report that has been submitted for pre-print on arxiv.org, and we will be looking to iteratively improve and submit for peer review to adequate journals. doctor-web-preprint Literature review We reviewed literature from two aspects: Web Queryability and Web Scraping The two concepts are fundamentally different. The former deals with the structure of web pages and tentatives to gradually move from the unstructured nature of the hyper text to web pages designed on top of structures like RDF or its more modern evolution JSON-LD. While these technologies would solve the problem of web data queryability (i.e. the structures are well-defined and addressable with existing query languages like RDF Query Language, JSON Path, JSON Query). This problem remains largely unsolved, and the adoption of these technologies has lacked progress and failed the initial predictions for moving to structured data for web pages. Technologies aside, the content of property web pages is seen as a valuable asset and there is no grand motivation to make it more easily accessible without losing control of the data. On the other hand, offering access via APIs allows more control over data and better opportunities’ monetisation detaching the web content (what users see) from the data This leads to the quest for the second aspect; being able to retrieve the information that end users see directly from the web pages. The concept is not new and in the literature review we bring some of the metadata analysis of different web scraping methods and strategies. Here again, our focus is on queryability. Rather than analysing the many packages and tools (e.g. Scrapy) that allow programmers to build web scraping scripts, we analysed the literature for efforts to providing a higher level to solve for querying unstructured web pages and extract data into structured or semi-structured representation (e.g. JSON). In fact, I had already dealt with this challenge in the past during my doctoral research and found that OXPath was addressing just the same problem. However, OXPath had become a legacy using old technologies and lacking community engagement and improvements. Hence, we set on the quest to build upon the OXPath research with the aim to (a) modernise the query language and support more modern technologies (i.e. rather than a non-retro-compatible extension of XPath, build on JSON5, YAML maintaining language retro-compatibility [see footnote 1]) (b) modernise extendability and focus on open-source and community contribution, and (c) address the dated technologies and support multiple underlying engines for extraction focussing on improving speed, memory and processing power usage. While we addressed the aspect of building doctor web in this (b)log, in the research paper included above we took some steps to benchmark and compare execution times, memory usage and compute power usage between doctor web and OXPath. Looking forward to hearing your thoughts and critique about the project and the research paper shared here. Feel free to drop a line in the comments. Queryable Web Research Review
Queryable Web Building a Data Retrieval Web Engine: A Step-by-Step Journey February 2, 2025February 10, 2025 Introduction In the Queryable Web Series, we’ve been exploring how to make web data as accessible and queryable as a database. The ultimate goal is to create tools that allow users to extract semi-structured data from websites using simple, declarative queries—without needing to write custom scripts for every use case…. Read More
Research Review Research Series: Exploring Trust and Beyond January 29, 2025January 29, 2025 Welcome to a new blog series focused on research—both my own contributions and influential studies in my field. This series will explore papers I have co-authored, peer-reviewed research, and significant developments that have shaped my work. Through these posts, I aim to break down complex ideas, highlight key insights, and… Read More
Queryable Web Ready for release: Making the dr-web-engine public February 13, 2025February 15, 2025 While we continue to add features to our data retrieval web engine (aka dr-web-engine, or shortly doctor web), we are going to explore some of the steps needed to (a) make the new engine available to anyone that wants to use it (b) make the source code available to anyone… Read More