DR Web Open Source Evolution — From Research Prototype to Community Project • Ylli Prifti

Remember DR Web Engine?

In early 2025, I wrote about The Queryable Web - a vision for making web data extraction feel more like querying a database. Born from my PhD research on OXPath, I rebuilt the concept into dr-web-engine, a JSON-based tool for structured web scraping.

That article captured the early stage: a working prototype with solid concepts but rough edges. Fast-forward to today, and DR Web Engine has undergone a complete transformation - not just technically, but in how it’s structured, tested, documented, and opened to community contribution.

This is the story of that evolution.

The Research Prototype Problem

The original DR Web Engine worked for my research needs, but had classic academic software issues:

Minimal testing: “It works on my machine” was the test suite
Poor documentation: README files that assumed too much context
Monolithic architecture: Adding features meant modifying core code
No contribution path: How would someone else even begin to help?
Legacy dependencies: Stuck with Firefox 50 limitations from OXPath heritage

Academic software often stays in this state - functional enough for research, too brittle for broader use. The gap between “research that works” and “production-ready software” is wider than it appears.

Architectural Transformation

Plugin System Design

The biggest architectural change was building extensibility from the ground up. Instead of cramming features into the core engine, we created a plugin ecosystem:

# Plugin interface enables community contributions
class ExtractorPlugin:
    def extract(self, page, config):
        """Plugin extraction logic"""
        pass
    
    def validate_config(self, config):
        """Plugin config validation"""
        pass

This enables features like:

AI-powered selection: Natural language element targeting via OpenAI/Ollama
JSON-LD extraction: Structured data from schema.org markup
API interception: Capturing AJAX/REST calls during page interaction

Modern Browser Integration

Moving from Firefox 50 constraints to Playwright opened up possibilities:

{
  "@url": "https://dynamic-site.com",
  "@actions": [
    {"@type": "wait", "@until": "element", "@selector": ".loaded"},
    {"@type": "javascript", "@code": "window.loadMoreContent();"},
    {"@type": "scroll", "@direction": "down", "@pixels": 500}
  ],
  "@steps": [
    {
      "@ai-select": "product prices with discounts",
      "@name": "prices"
    }
  ]
}

The engine now handles dynamic content, JavaScript execution, and complex interactions that were impossible with the original architecture.

DR Web Engine Evolution

Building for Community

Testing as Foundation

Academic code often ships without tests. Making DR Web Engine community-ready meant comprehensive test coverage:

200+ tests across unit, integration, and end-to-end scenarios
Mock browser client for CI environments without display servers
Plugin testing framework for community-contributed extensions
Automated testing on every pull request and release

# Example test structure enabling confident contributions
class TestActionProcessor:
    def test_click_action_execution(self):
        """Test click actions work correctly"""
        
    def test_javascript_error_handling(self):
        """Test JavaScript execution error paths"""
        
    def test_conditional_logic_branching(self):
        """Test if/then/else query logic"""

Documentation Strategy

Research documentation often assumes domain expertise. Community documentation needs to work for newcomers:

Getting Started Guide: 20+ examples from basic to advanced
Comprehensive examples directory: Organized by use case (e-commerce, news, social media)
Plugin development guides: Enable community extension development
API reference: Complete keyword and configuration documentation

The examples evolved from research-focused demos to practical patterns:

// Before: Research demo
{"@url": "test-site.com", "@steps": [...]}

// After: Production pattern
{
  "@url": "https://real-ecommerce.com/products",
  "@actions": [
    {"@type": "wait", "@until": "network-idle", "@timeout": 10000}
  ],
  "@steps": [
    {
      "@xpath": "//div[@class='product-card']",
      "@fields": {
        "name": ".//h3[@class='product-title']/text()",
        "price": ".//span[@class='price']/text()",
        "rating": ".//div[@class='rating']/@data-rating"
      }
    }
  ],
  "@pagination": {
    "@xpath": "//a[contains(@class, 'next-page')]",
    "@limit": 5
  }
}

AI Integration and Accessibility

One unexpected direction was AI integration - not for the sake of jumping on trends, but for solving a real usability problem: XPath barrier to entry.

Natural Language Selection

Instead of requiring XPath expertise:

{
  "@url": "https://shopping-site.com",
  "@steps": [
    {
      "@ai-select": "customer review text and ratings",
      "@name": "reviews",
      "@max-results": 10
    }
  ]
}

This works with both OpenAI’s API and local Ollama models, making the tool accessible to users who understand data extraction but not XPath syntax.

LLM-Optimized Outputs

The engine now produces formats optimized for AI training workflows:

JSONL streaming: Perfect for large dataset processing
OpenAI Chat format: Ready for fine-tuning pipelines
Anthropic format: Compatible with Claude training data

This bridges web scraping with modern AI development workflows.

Academic Validation and Community Building

arXiv Publication

Transitioning from prototype to community project required academic credibility. We published a comprehensive paper comparing DR Web Engine with existing tools, including performance benchmarks and technical innovations.

The paper serves multiple purposes:

Academic legitimacy for researchers considering the tool
Technical documentation of design decisions and innovations
Benchmark baseline for future improvements and comparisons

Early Community Response

While adoption is still growing, early feedback has been encouraging:

Plugin contributions: Community members extending functionality
Use case diversity: Applications beyond my original research scope
Technical feedback: Issues and improvements from real-world usage

The most valuable feedback has been about use cases I hadn’t considered - from monitoring competitor pricing to extracting structured data for AI training datasets.

Lessons in Research-to-Production Evolution

What Works

Incremental public development: Publishing early and iterating based on feedback proved more effective than trying to build everything privately first.

Plugin architecture from day one: Even simple features benefit from modular design when you’re building for community contribution.

Comprehensive documentation: Time invested in examples and guides pays dividends in community engagement and reduced support burden.

What’s Challenging

Balancing simplicity with power: Advanced features can complicate the simple use cases that attract new users.

Managing backwards compatibility: Early API decisions become difficult to change once people depend on them.

Community governance: Establishing contribution standards and review processes without discouraging participation.

Academic Software Transition Points

The journey highlighted specific moments where academic code needs professional software practices:

Testing: When you want others to confidently modify code
Documentation: When you want users beyond yourself
Architecture: When you want features beyond your original vision
Governance: When you want sustainable community contribution

Current State and Future Direction

DR Web Engine has evolved from a research tool solving my data collection problems to a platform enabling broader web data accessibility. The plugin system allows community-driven feature development, while AI integration reduces technical barriers.

Key technical capabilities now include:

Advanced interaction handling: JavaScript execution, form automation, dynamic content
Conditional extraction logic: Smart branching based on page states
Recursive navigation: Following link chains with cycle detection
Multiple output formats: From JSON to AI training datasets

But the more significant evolution is cultural: from research prototype to community project with sustainable development practices, comprehensive documentation, and pathways for contribution.

What’s Next

The roadmap focuses on community enablement rather than just feature addition:

Plugin marketplace: Making community extensions discoverable
Educational content: Workshops and tutorials for web scraping newcomers
Integration examples: Connecting with popular data processing workflows
Performance optimization: Scaling to larger extraction tasks

Reflections on Open Source Evolution

This evolution taught me that successful open source isn’t just about releasing code - it’s about building sustainable systems for collaboration. The technical transformation from prototype to production was significant, but the cultural transformation from personal tool to community project was equally important.

Academic software often stays locked in institutional silos not because the research isn’t valuable, but because the transition costs are underestimated. Making research code community-ready requires rethinking architecture, documentation, testing, and governance from scratch.

DR Web Engine’s journey from PhD research tool to open source platform demonstrates that academic software can successfully evolve into community projects - but it requires intentional investment in the infrastructure of collaboration, not just the technology itself.

The result is software that serves broader needs while maintaining the research insights that sparked its creation. Sometimes the best way to advance research is to build the tools that enable others to explore alongside you.