Building a Data Retrieval Web Engine: A Step-by-Step Journey

Introduction

In the Queryable Web Series, we’ve been exploring how to make web data as accessible and queryable as a database. The ultimate goal is to create tools that allow users to extract semi-structured data from websites using simple, declarative queries—without needing to write custom scripts for every use case. In this post, we’ll dive into the development of a Data Retrieval Engine (aka dr-web-engine) that brings us one step closer to this vision. By the end of this article, we would have built a tool that can extract data from search results, handle pagination, and even follow links to extract nested data (like reviews for each sitter). Let’s get started!

1. The Vision: A Queryable Web

The Queryable Web is about making web data accessible to everyone. Imagine being able to write a query like this:

SELECT title, url, description FROM google.com WHERE search = 'Hello World';

And getting structured results from a website, just like you would from a database. While we’re not quite there yet, the Data Retrieval Engine is a step in that direction. It allows users to define extraction rules in a simple, JSON-based query language, making it easier to extract semi-structured web data.

2. The Problem: Why Build a Data Retrieval Engine?

Traditional web scraping often involves writing custom scripts for each website. This approach is time-consuming, brittle, and hard to maintain. Websites change their structure frequently, and keeping up with these changes can be a nightmare. Moreover, extracting nested data (like reviews for each sitter) or handling pagination requires additional logic, making the scripts even more complex. The Data Retrieval Engine solves these problems by providing a declarative query language that abstracts away the underlying implementation details. Instead of writing code, users define what data to extract and how to extract it, using a simple JSON or YAML format.

3. Building the Query Language

The first step is to design a query language that could express extraction rules clearly. For example the following query defines the extraction of full names, ratings, images and distance from a sitters search:

{
  "url": "https://www.childcare.co.uk/search/Babysitters/BR6+9AA",
  "steps": [
    {
      "xpath": "//div[contains(@class, 'search-result')]",
      "fields": {
        "full_name": ".//div[contains(@class, 'items-baseline')]/div[1]/span[1]/text()",
        "rating": ".//div[contains(@class, 'rating')]/span[1]/text()",
        "distance": ".//span[contains(@class, 'distance')]/span[2]/normalize-space()",
        "image_url": ".//div[contains(@class, 'profile-image')]//img[1]/@src"
      }
    }
  ]
}

This query tells the engine to:

Go to the specified URL.
Find all elements matching the XPath //div[contains(@class, 'search-result')].
Extract the full_name, rating, distance, and image_url for each result.

The problem is not new and in fact we are building on top of OXPath, a web retrieval engine proposed and implemented at university of Oxford.

OXPath is a web data extraction language that extends XPath. OXPath is specifically designed for extracting structured data from web pages, which are often represented in HTML (a form of XML). It allows users to specify patterns and paths to locate and extract data from web pages, making it useful for web scraping and data mining tasks.

Key features of OXPath:

1. Navigation and Extraction: OXPath extends XPath by adding capabilities for interacting with web pages, such as clicking buttons, filling out forms, and navigating through multiple pages.

2. Stateful Interaction: Unlike XPath, which is stateless, OXPath can handle stateful interactions with web pages, such as logging in, navigating through pagination, or interacting with dynamic content.

3. Declarative Syntax: OXPath uses a declarative syntax, allowing users to specify what data to extract without needing to write complex procedural code.

4. Integration with Web Browsers: OXPath can be integrated with web browsers to simulate user interactions, making it suitable for extracting data from modern, JavaScript-heavy websites.

OXPath is particularly useful in scenarios where data extraction requires interaction with the web page, such as scraping data from behind login forms, extracting data from multi-step processes, or dealing with dynamic content loaded via AJAX.

These are some of the nice features of the current implementation of oxpath, as described by the authors. However, the current implementation has not seen great involvement by the community and has become legacy using some old frameworks and versions that makes it prohibitive to contribute and use in the current day. For example the web driver and selenium framework use a fixed version which is now almost 10 years old, fixing the browser version to about Firefox v50. This becomes a big issue as model web pages won’t work with old browsers at best and at worst will completely block access. Additionally, considering the state of art for web technologies, XPath is less used and augmented syntax on Json or Yaml is likely to be more natural and less complex.

Let’s dive into it!

4. Web data retrieval CLI

The current implementation of oxpath has a CLI that uses oxpath queries to extract data like follows:

java -jar bin/oxpath-cli.jar -q queryfile.oxpath -f json -o data.json -xvfb

and the typical oxpath query looks like the following:

doc(’https://www.aclweb.org/anthology/K/K16/’)
2 //div[@id=’content’]/p:<article>
3 [./i:<title=string(.)>]
4 [./b:<authors=normalize-space(.)>]
5 [./a:<pdf=qualify-url(@href)>]
6 [./preceding::h1[1]:<publication=string(.)>]

It makes sense to match the existing OXpath CLI and we are going to gradually support most of the existing parameters that are related to data retrieval. However, the OXPath CLI also supports output into MySQL database by providing the connection parameters. We are unlikely to need and support these parameters. The following code snipped shows some of the parameter configurations into the CLI.
The end state of the code progressed part of this article can be found here: https://github.com/starlitlog/dr-web-engine/tree/4d371403702c7569e568d5d9885001f004e5f466/web_cli

 parser = argparse.ArgumentParser(description="OXPath-like JSON Query CLI")
    parser.add_argument("-q", "--query", required=True, help="Path to the query file")
    parser.add_argument("-o", "--output", required=True, help="Output file name")
    parser.add_argument(
        "-f", "--format", default="json5", choices=["json5", "yaml"],
        help="Query language format (default: json5)"
    )
    parser.add_argument(
        "-l", "--log-level", default="info", choices=["error", "warning", "info", "debug"],
        help="Logging level (default: error)"
    )
    parser.add_argument("--log-file", help="Path to the log file (default: stdout)")
    args = parser.parse_args()

5. The query language format

The current implementation of oxpath is a superset of xpath with support for additional keywords and syntax to define the output keys, types and value transofrmations. Oxpath is not valid xpath anymore and the added syntax can make the code quite complex to understand. Below is one of the complex queries i used for my doctoral research, and as you can see it can get both big and complex. Maintaining it overtime as page strucutures naturally change, can be combersome. Additionally, the query doesn’t resemble or give any indication of the strucutre of the output, which sometimes can make it difficult to refine requiring multiple executions to get to the final results.

doc('https://www.childcare.co.uk/profile/257528')
    /.:<result> [
        .:<attributes> [
             .[? //*[@id="center"]/div/h1[text()="Profile Unavailable"]:<closed=string(@title)> ]
              [? //*[@id="center"]/div/h1[text()="Page not found!"]:<deleted=string(.)> ]
        ]
        :<data> [
			    .[? //script[@type="application/ld+json" and not(contains(text(), "linkedin"))]:<rdf_payload=string(.)> ]
				  //div[@id="center"]:<profile>
			 	  [.:<profile_id=current-url()>]
                  [? .//div[@class="profile-image"]/a/img:<image=qualify-url(@src)>]
                  [? .//div[contains(@class, "profile-header")]//img[contains(@class, "star-rating")]/@alt:<rating=number(normalize-space(substring(string(.),0,2)))>]
                 ( ... code truncated ... )      
                  [? .//h3[text()="My Qualifications"]/../p:<my_qualifications=normalize-space(.)>]
                  [? .//h3[text()="My Availability"]/..:<my_availability=normalize-space(.)>]
                  [? .//h3[text()="My Fees"]/..:<my_fees=normalize-space(.)>]
                  [? .//h3[text()="My Local Schools"]/..:<my_local_schools=normalize-space(.)>]
                  [? .//h3[text()="My Documents"]/../ul/li:<my_documents> [
                        .:<document_description=normalize-space(.)>
                        /small:<document_date=string(.)>
                  ]]
              ( ... code truncated ... )
             //div[@class="profile-footer actions cf"]//a[contains(.,'See all Reviews')]/{click /}
            	/(//a[contains(@class,"next")]/{click /})*
                  //div[@class="reviews"]//div[@class="review"]:<reviews> 
                    [
                        ./div[@class="title cf"]:<title=normalize-space(.)> 
                        /div[@class="rating"]/img/@alt:<rating_average=number(normalize-space(substring(string(.),0,2)))>
                        /../../../../p[@class="rText"]:<body=normalize-space(.)>
                        /../p[2]//a[1]:<author_url=qualify-url(@href)>
              ( ... code truncated ... )
                        /ancestor::ul/li[3]//img/@alt:<rating_cleanliness=number(normalize-space(substring(string(.),0,2)))>  
                        /ancestor::ul/li[4]//img/@alt:<rating_food=number(normalize-space(substring(string(.),0,2)))> 
                        /ancestor::ul/li[5]//img/@alt:<rating_communication=number(normalize-space(substring(string(.),0,2)))> 

            ]
         
	]
]

We evaluated a number of options to improve the query language. Using xpath as a way to target the elements of the page DOM is still the best options since works well with XML based languages (like HTML) and also allows targetting elements by class, id or any other attribute. Xpath is also supported by most of the potential engines that use web-drivers to scan and extract data from the web (e.g. Selenium) or newer technologies that have built in web drivers like Playwright.

We considered Yaml, Json, Toml, DSL, and OXpath and while all are valid options offering pros and cons, in this first iterations we landed with building the query language as an extension of Json5.

Json5 is retro compatible with Json (i.e. you can always write the query in Json and that will work), however offers some more capabilities that are useful when writting queries (e.g. commenting). Additionally, queries written in Json have the additional feature that the output structure is a subset of the query. For example the following Json5 query and output show some of these fatures:

{
  "url": "https://www.childcare.co.uk/search/Babysitters/BR6+9AA",
  "steps": [
    {
      "xpath": "//div[contains(@class, 'search-result')]",  // Target each sitter listing
      "fields": {
        "full_name": ".//div[contains(@class, 'items-baseline')]/div[1]/span[1]/text()",
        "rating": ".//div[contains(@class, 'rating')]/span[1]/text()",
        "distance": ".//div[contains(@class, 'distance')]/span[2]/normalize-space()",
        "image_url": ".//div[contains(@class, 'profile-image')]//img[1]/@src"
      }
    }
  ]
}

[
  {
    "full_name": "John Doe",
    "rating": "5.0",
    "distance": "1.2 miles",
    "image_url": "https://example.com/john-doe.jpg"
  },
  {
    "full_name": "Jane Smith",
    "rating": "4.5",
    "distance": "2.5 miles",
    "image_url": "https://example.com/jane-smith.jpg"
  }
]

While our focus will be on the Json5 as the main query language for dr-web-engine (aka data retreival web engine), the code and CLI also support Yaml as an additional language. This is to setup the basic pattern for additional community supported languages

6. Adding Logging and Debugging

To make the engine more user-friendly, we added logging and debugging capabilities. The CLI supports specifying the logging level (error, warning, info, debug) and output logs to a file. This makes it easier to troubleshoot issues and understand what’s happening under the hood. For example, if an XPath expression doesn’t match any elements, the engine logs a warning and continues execution.

Error: For critical issues (e.g., exceptions).

Warning: For non-blocking issues (e.g., missing fields).

Info: For step progress (e.g., starting a method).

Debug: For detailed tracking of every step.

    parser.add_argument(
        "-l", "--log-level", default="info", choices=["error", "warning", "info", "debug"],
        help="Logging level (default: error)"
    )
    parser.add_argument("--log-file", help="Path to the log file (default: stdout)")

Example using the CLI to change the log level and log output channel

python cli.py -q query.json5 -f json5 -l debug --log-file=debug-output.log -o sitters.json

and the log file looking like follows:

2025-02-09 21:33:47,432 - DEBUG - Using selector: KqueueSelector
2025-02-09 21:33:47,996 - INFO - Launching browser...
2025-02-09 21:33:50,624 - INFO - Navigating to URL: https://www.childcare.co.uk/search/Babysitters/BR6+9AA
2025-02-09 21:33:52,178 - INFO - Executing step with XPath: //div[contains(@class, 'search-result')]
2025-02-09 21:33:52,245 - DEBUG - Found 20 elements with XPath: //div[contains(@class, 'search-result')]
2025-02-09 21:33:52,919 - INFO - Pagination limit reached or no pagination specified.
2025-02-09 21:33:52,919 - INFO - Closing browser...
2025-02-09 21:33:53,597 - INFO - Results saved to sitters.json

7. The Road Ahead

While we’ve made great progress, there’s still a lot to do. In the next phase, we’ll implement pagination and the Kleene star, allowing the engine to scrape multiple pages and extract nested data. We’ll also explore advanced features like handling authentication, dealing with CAPTCHAs, and optimizing performance. Lastly, we are going to add some more support for community contribution (readme, github structure) and updating the outdated version of dr-web-engine on pip repository. Stay tuned for the next post, where we’ll dive deeper into these challenges!

Conclusion

Building the DR Web Engine has been an exciting journey so far. We’ve created a flexible, modular tool that can extract structured data from web pages using a simple query language. Along the way, we’ve tackled challenges like dynamic content, inconsistent class names, and messy data. But this is just the beginning—there’s still so much more to explore! If you’re interested in trying out the DR Web Engine or contributing to the project, feel free to reach out. I’d love to hear your thoughts and feedback. Until next time, happy scraping!

Share your experience

What challenges have you faced with web scraping? Do you have any tips or tools to share? Let me know in the comments below! And if you found this post helpful, don’t forget to share it with your network.

Queryable Web