Queryable Web Series – Introduction Ylli P, January 29, 2025January 29, 2025 The web is an ocean of information, yet efficiently extracting structured data from it remains a challenge. Traditional approaches like web scraping often rely on brittle, ad-hoc scripts or centralized APIs with limited flexibility. In this new blog series, Queryable Web, I will explore an alternative approach: a structured, queryable interface for web data extraction that extends beyond the limitations of existing tools. From OXPath to a Modern Solution This series builds on my PhD research, where I explored scalable web scraping architectures using OXPath, an extension of XPath designed to query and scrape web content effectively. While OXPath provided a structured mechanism for extracting web data, I encountered significant roadblocks due to its reliance on legacy libraries and its rigidity in terms of extensibility. To overcome these limitations, I developed dr-web-engine, a lightweight implementation of OXPath that replaces XPath-based querying with a JSON-based approach, making it more flexible and developer-friendly. While dr-web-engine proved useful in my research, it was never fully developed into a sustainable open-source project. This series aims to change that. What to Expect in This Series Over the coming posts, I will document the journey of rewriting and extending dr-web-engine into a modern, community-driven open-source project. Topics will include: The Challenges of Web Scraping Today – A look at the limitations of traditional scraping techniques and why a queryable web matters. Understanding OXPath – How OXPath works, its strengths, and why it struggled to gain traction. Introducing dr-web-engine – A deep dive into its architecture, key features, and why JSON-based queries offer an advantage over XPath. Rewriting dr-web-engine for Sustainability – Modernizing the codebase, improving performance, and making it extensible. Handling JavaScript-Heavy Websites – Strategies for integrating headless browsers and dealing with dynamic content. Building a Queryable Web API – Turning dr-web-engine into a web-accessible service with a RESTful or GraphQL API. Scaling Web Scraping – Distributed scraping, concurrency management, and avoiding common pitfalls. Security and Ethics in Web Scraping – Respecting terms of service, handling CAPTCHAs, and ethical considerations. Getting Community Involvement – Strategies for fostering an open-source community around the project. By the end of this series, I hope to establish dr-web-engine as a practical and sustainable tool that empowers developers and researchers to extract web data in a more structured and scalable way. If you’re interested in contributing, stay tuned for the next post, where I’ll break down the challenges of web scraping today and why we need a more robust solution. Queryable Web
Queryable Web Building a Data Retrieval Web Engine: A Step-by-Step Journey February 2, 2025February 10, 2025 Introduction In the Queryable Web Series, we’ve been exploring how to make web data as accessible and queryable as a database. The ultimate goal is to create tools that allow users to extract semi-structured data from websites using simple, declarative queries—without needing to write custom scripts for every use case…. Read More
Queryable Web Ready for release: Making the dr-web-engine public February 13, 2025February 15, 2025 While we continue to add features to our data retrieval web engine (aka dr-web-engine, or shortly doctor web), we are going to explore some of the steps needed to (a) make the new engine available to anyone that wants to use it (b) make the source code available to anyone… Read More
Queryable Web Research Series: Publishing a report about doctor web February 19, 2025April 16, 2025 As we come to the end of the first phase of doctor web (data retrieval web engine: https://pypi.org/project/dr-web-engine/), with the project being fully open sources and published for end user consumption on the Docker Hub repository and Python Package repository, we want to make sure we can share the motivations… Read More