Ready for release: Making the dr-web-engine public

While we continue to add features to our data retrieval web engine (aka dr-web-engine, or shortly doctor web), we are going to explore some of the steps needed to (a) make the new engine available to anyone that wants to use it (b) make the source code available to anyone that wants to fork it (c) lastly but not least, have some good unit test coverage, build and publishing pipelines, branching strategy and documentation to call for open source contributions. By the end of this article we will have improved some of the fundamental features, be able to pip install our doctor web and run queries and lastly, fork and contribute a change to our public and open source GitHub repository. Oh, and if you are curious what prompt I used to generate the featured image, jump to the first comment of this post.

Improving testing

As with any project that is heading towards production, we need to start increasing test coverage and address the tech debt that prevents testability. At this stage we need to have good coverage of unit testing and some integration tests. At this stage our code isn’t in a great shape for unit testing since playwright is used all around our engine, and it would be difficult to mock. The playwright browser itself is instantiated in cli.py and that is going to be a problem when running the tests with any CI configuration.

As a first step, we created an abstract browser class that is injected into the engine, and remove any references to playwright:

from abc import ABC, abstractmethod
from typing import Dict, List, Any


class BrowserClient(ABC):
    """Abstract base class for browser clients."""

    @abstractmethod
    def navigate(self, url: str) -> None:
        """Navigate to a URL."""
        pass

    @abstractmethod
    def query_selector(self, selector: str) -> Any:
        """Query the DOM for an element matching the selector."""
        pass

    @abstractmethod
    def close(self) -> None:
        """Close the browser."""
        pass

with the concrete implementation using playwright as shown in the code sniped below on the left, and on the other hand (right), we can implement the abstract class using mock to help with our testing

from playwright.sync_api import sync_playwright, Playwright

...

class PlaywrightClient(BrowserClient, ABC):
    """Concrete implementation of BrowserClient using Playwright."""

    def __init__(self, xvfb: bool = False):
        self.playwright: Playwright | None = None
        self.browser = None
        self.page = None
        self.xvfb = xvfb
...
    def navigate(self, url: str) -> None:
        """Navigate to the given URL."""
        self.page.goto(url)

    def query_selector(self, selector: str) -> Any:
        """Return an element matching the given selector."""
        return self.page.query_selector(selector)

...

from unittest.mock import MagicMock

class MockBrowserClient(BrowserClient):
    def __init__(self):
        self.page = MagicMock()
        self.browser = MagicMock()
        self.url = "https://example.com"

...

    def navigate(self, url: str) -> None:
        self.url = url

    def query_selector(self, selector: str) -> Any:
        print(f"Querying selector: {selector}")  # Debugging

        if selector == ".//div[contains(@class, 'items-baseline')]/div[1]/span[1]/text()":
            return MockElement("Test Name")
        elif selector == ".//div[contains(@class, 'rating')]/span[1]/text()":
            return MockElement("5 stars")
        elif selector == ".//div[contains(@class, 'profile-image')]//img[1]/@src":
            return MockElement("https://example.com/image.jpg")

        print(f"No match for selector: {selector}")
        return None

At the end of this process we wrote about 16 unit tests covering the engine, the extraction and the parsers.

Folder structure and public repository configuration

Another aspect to address is the folder structure. Some refactoring of the structure is needed so that it is easy and intuitive to read for anyone coming new to the project. After some thoughts and conversation with ChatGPT/DeepSeek (and non-discriminatorily lots of other models we have set-up on https://chat.prifti.us, but that is another story), we settled for the following file structure:

.github/
├── ISSUE_TEMPLATE/
├── workflows/
cli/
├── __init__.py
├── cli.py
engine/
├── data/
│   ├── data.json
│   ├── query.json5
│   ├── query.yaml
│   ├── sitters.json
├── tests/
│   ├── integration/
│   ├── unit/
│   ├── __init__.py
│   ├── conftest.py
web_engine/
├── base/
├── parsers/
├── utils/
│   ├── __init__.py
│   ├── engine.py
│   ├── extractor.py
│   ├── models.py
│   ├── utils.py
.gitignore
CODE_OF_CONDUCT.md
CONTRIBUTING.md
LICENSE

At this point the code is ready to be uploaded to an open repository, for which (as with any other open source code discussed on this starlit (b)log) we are going to use GitHub and, more precisely, for ‘doctor web’ https://github.com/starlitlog/dr-web-engine

In addition to the changes above, we set up the repository for community contribution. Some of these configurations are:

Include a code of conduct
Include a Contributions README file
Set up the repository for auto running tests with every commit
Require approval before commits to main branch
Suggest forking and rising PRs to contributors
Test the configuration works (I forked, raised PRs, checked CI was triggered, approved and merged the PR)

Publication on python package index https://pypi.org/

We want to make our new open source tool available on the public python package index and make sure that it is easily accessible and executable for extraction of structured data from the web, as stated in the goal of these series. And in fact we have done just that. It is as easy as running:

pip install dr-web-engine

Let’s rewind the steps we took to re-publish the newest version of doctor web. Firstly, as stated in the introduction of these series, I wrote and published dr-web-engine as a side project during my doctoral research. I was collecting data from multiple web sources and had written a K8S based scalable scraping engine (see chapter 5 and 6 of my thesis: https://eprints.bbk.ac.uk/id/eprint/52517/). I had written many queries for scrapping using OXPath, however OXPath was showing some of its limits because of lack of updates. To overcome some of these, I set with the intention to re-write parts of OXPath using python (rather than Java) and changing the query structure from XPath basis to JSON. That version was published on pypi.org as a pre-release and the last version was updated on 6th of September 2020: https://pypi.org/project/dr-web-engine/0.3.2.2b0/. This is a complete re-write to make it easier for community contributions, but holding to some of the same concepts.

With that being said, we published the latest version of our build to pypi.org (https://pypi.org/project/dr-web-engine/) and more importantly, set up the continues delivery configuration for publishing the newest version from GitHub. In addition to our CI action described above, we set up an additional action that is triggered every time a new commit is tagged with release_v*. The action will test, build and publish the new version to pypi.org. The image below shows the latest build and release of the version currently being the latest version on pypi.org.

Conclusions and next steps

We have now achieved the goal we set up with these series of articles. Users now can (a) install and run queries to extract structured data from web pages, and (b) collaborate and contribute to extend doctor web.

While this article is the last of this series, there are some important next-steps that will likely lead to other series of articles, building on top of what we have done and shared so far.

Write up a research article and publish a pre-print in https://arxiv.org/
Create a roadmap of upcoming needed features
Create a set of slides with a recap for easy consumption
Socialize with the community for further usage and contribution

While doctor web is fully usable and can get done quite a bit, the current version clearly misses some fundamental features that are part of OXPath. Probably the most notorious missing features are the actions. For broader use-case coverage, the next set of improvement are likely to be around supporting actions like click, submit or other form actions like data-entry.

We hope that dynamic growth and feature requests will better set up the future (if any) of the tool, and will be looking forward to feature requests in these starlit log discussion pages or part of GitHub discussions.

With that, this is a wrap. Thanks all for following along.

Queryable Web

5 2 votes

Article Rating

1 Comment

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Author

Ylli P

4 months ago

Promot for generating the image:

“generate an image to represent a python engine that runs json queries to extract strucured json data from the web, called dr-web-engine which ironaclly is also refeered to as doctor web”

The first tenative wasn’t that great, so the following second hint helped get to this image:
“I like the cube dr-web-engine reference and the python logo, but nto so sure about the arms comming out of the box. Remove them and add some hints of html/json in the background”

Improving testing

Folder structure and public repository configuration

Publication on python package index https://pypi.org/

Conclusions and next steps

Related Posts