Ready for release: Making the dr-web-engine public Ylli P, February 13, 2025February 15, 2025 While we continue to add features to our data retrieval web engine (aka dr-web-engine, or shortly doctor web), we are going to explore some of the steps needed to (a) make the new engine available to anyone that wants to use it (b) make the source code available to anyone that wants to fork it (c) lastly but not least, have some good unit test coverage, build and publishing pipelines, branching strategy and documentation to call for open source contributions. By the end of this article we will have improved some of the fundamental features, be able to pip install our doctor web and run queries and lastly, fork and contribute a change to our public and open source GitHub repository. Oh, and if you are curious what prompt I used to generate the featured image, jump to the first comment of this post. Improving testing As with any project that is heading towards production, we need to start increasing test coverage and address the tech debt that prevents testability. At this stage we need to have good coverage of unit testing and some integration tests. At this stage our code isn’t in a great shape for unit testing since playwright is used all around our engine, and it would be difficult to mock. The playwright browser itself is instantiated in cli.py and that is going to be a problem when running the tests with any CI configuration. As a first step, we created an abstract browser class that is injected into the engine, and remove any references to playwright: from abc import ABC, abstractmethod from typing import Dict, List, Any class BrowserClient(ABC): """Abstract base class for browser clients.""" @abstractmethod def navigate(self, url: str) -> None: """Navigate to a URL.""" pass @abstractmethod def query_selector(self, selector: str) -> Any: """Query the DOM for an element matching the selector.""" pass @abstractmethod def close(self) -> None: """Close the browser.""" pass with the concrete implementation using playwright as shown in the code sniped below on the left, and on the other hand (right), we can implement the abstract class using mock to help with our testing from playwright.sync_api import sync_playwright, Playwright ... class PlaywrightClient(BrowserClient, ABC): """Concrete implementation of BrowserClient using Playwright.""" def __init__(self, xvfb: bool = False): self.playwright: Playwright | None = None self.browser = None self.page = None self.xvfb = xvfb ... def navigate(self, url: str) -> None: """Navigate to the given URL.""" self.page.goto(url) def query_selector(self, selector: str) -> Any: """Return an element matching the given selector.""" return self.page.query_selector(selector) ... from unittest.mock import MagicMock class MockBrowserClient(BrowserClient): def __init__(self): self.page = MagicMock() self.browser = MagicMock() self.url = "https://example.com" ... def navigate(self, url: str) -> None: self.url = url def query_selector(self, selector: str) -> Any: print(f"Querying selector: {selector}") # Debugging if selector == ".//div[contains(@class, 'items-baseline')]/div[1]/span[1]/text()": return MockElement("Test Name") elif selector == ".//div[contains(@class, 'rating')]/span[1]/text()": return MockElement("5 stars") elif selector == ".//div[contains(@class, 'profile-image')]//img[1]/@src": return MockElement("https://example.com/image.jpg") print(f"No match for selector: {selector}") return None At the end of this process we wrote about 16 unit tests covering the engine, the extraction and the parsers. Folder structure and public repository configuration Another aspect to address is the folder structure. Some refactoring of the structure is needed so that it is easy and intuitive to read for anyone coming new to the project. After some thoughts and conversation with ChatGPT/DeepSeek (and non-discriminatorily lots of other models we have set-up on https://chat.prifti.us, but that is another story), we settled for the following file structure: .github/ ├── ISSUE_TEMPLATE/ ├── workflows/ cli/ ├── __init__.py ├── cli.py engine/ ├── data/ │ ├── data.json │ ├── query.json5 │ ├── query.yaml │ ├── sitters.json ├── tests/ │ ├── integration/ │ ├── unit/ │ ├── __init__.py │ ├── conftest.py web_engine/ ├── base/ ├── parsers/ ├── utils/ │ ├── __init__.py │ ├── engine.py │ ├── extractor.py │ ├── models.py │ ├── utils.py .gitignore CODE_OF_CONDUCT.md CONTRIBUTING.md LICENSE At this point the code is ready to be uploaded to an open repository, for which (as with any other open source code discussed on this starlit (b)log) we are going to use GitHub and, more precisely, for ‘doctor web’ https://github.com/starlitlog/dr-web-engine In addition to the changes above, we set up the repository for community contribution. Some of these configurations are: Include a code of conduct Include a Contributions README file Set up the repository for auto running tests with every commit Require approval before commits to main branch Suggest forking and rising PRs to contributors Test the configuration works (I forked, raised PRs, checked CI was triggered, approved and merged the PR) Publication on python package index https://pypi.org/ We want to make our new open source tool available on the public python package index and make sure that it is easily accessible and executable for extraction of structured data from the web, as stated in the goal of these series. And in fact we have done just that. It is as easy as running: pip install dr-web-engine Let’s rewind the steps we took to re-publish the newest version of doctor web. Firstly, as stated in the introduction of these series, I wrote and published dr-web-engine as a side project during my doctoral research. I was collecting data from multiple web sources and had written a K8S based scalable scraping engine (see chapter 5 and 6 of my thesis: https://eprints.bbk.ac.uk/id/eprint/52517/). I had written many queries for scrapping using OXPath, however OXPath was showing some of its limits because of lack of updates. To overcome some of these, I set with the intention to re-write parts of OXPath using python (rather than Java) and changing the query structure from XPath basis to JSON. That version was published on pypi.org as a pre-release and the last version was updated on 6th of September 2020: https://pypi.org/project/dr-web-engine/0.3.2.2b0/. This is a complete re-write to make it easier for community contributions, but holding to some of the same concepts. With that being said, we published the latest version of our build to pypi.org (https://pypi.org/project/dr-web-engine/) and more importantly, set up the continues delivery configuration for publishing the newest version from GitHub. In addition to our CI action described above, we set up an additional action that is triggered every time a new commit is tagged with release_v*. The action will test, build and publish the new version to pypi.org. The image below shows the latest build and release of the version currently being the latest version on pypi.org. Conclusions and next steps We have now achieved the goal we set up with these series of articles. Users now can (a) install and run queries to extract structured data from web pages, and (b) collaborate and contribute to extend doctor web. While this article is the last of this series, there are some important next-steps that will likely lead to other series of articles, building on top of what we have done and shared so far. Write up a research article and publish a pre-print in https://arxiv.org/ Create a roadmap of upcoming needed features Create a set of slides with a recap for easy consumption Socialize with the community for further usage and contribution While doctor web is fully usable and can get done quite a bit, the current version clearly misses some fundamental features that are part of OXPath. Probably the most notorious missing features are the actions. For broader use-case coverage, the next set of improvement are likely to be around supporting actions like click, submit or other form actions like data-entry. We hope that dynamic growth and feature requests will better set up the future (if any) of the tool, and will be looking forward to feature requests in these starlit log discussion pages or part of GitHub discussions. With that, this is a wrap. Thanks all for following along. Queryable Web
Queryable Web Research Series: Publishing a report about doctor web February 19, 2025April 16, 2025 As we come to the end of the first phase of doctor web (data retrieval web engine: https://pypi.org/project/dr-web-engine/), with the project being fully open sources and published for end user consumption on the Docker Hub repository and Python Package repository, we want to make sure we can share the motivations… Read More
Queryable Web Building a Data Retrieval Web Engine: A Step-by-Step Journey February 2, 2025February 10, 2025 Introduction In the Queryable Web Series, we’ve been exploring how to make web data as accessible and queryable as a database. The ultimate goal is to create tools that allow users to extract semi-structured data from websites using simple, declarative queries—without needing to write custom scripts for every use case…. Read More
Queryable Web Queryable Web Series – Introduction January 29, 2025January 29, 2025 The web is an ocean of information, yet efficiently extracting structured data from it remains a challenge. Traditional approaches like web scraping often rely on brittle, ad-hoc scripts or centralized APIs with limited flexibility. In this new blog series, Queryable Web, I will explore an alternative approach: a structured, queryable… Read More