DR Web Open Source Evolution — From Research Prototype to Community Project
Remember DR Web Engine?
In early 2025, I wrote about The Queryable Web - a vision for making web data extraction feel more like querying a database. Born from my PhD research on OXPath, I rebuilt the concept into dr-web-engine, a JSON-based tool for structured web scraping.
That article captured the early stage: a working prototype with solid concepts but rough edges. Fast-forward to today, and DR Web Engine has undergone a complete transformation - not just technically, but in how it’s structured, tested, documented, and opened to community contribution.
This is the story of that evolution.
The Research Prototype Problem
The original DR Web Engine worked for my research needs, but had classic academic software issues:
- Minimal testing: “It works on my machine” was the test suite
- Poor documentation: README files that assumed too much context
- Monolithic architecture: Adding features meant modifying core code
- No contribution path: How would someone else even begin to help?
- Legacy dependencies: Stuck with Firefox 50 limitations from OXPath heritage
Academic software often stays in this state - functional enough for research, too brittle for broader use. The gap between “research that works” and “production-ready software” is wider than it appears.
Architectural Transformation
Plugin System Design
The biggest architectural change was building extensibility from the ground up. Instead of cramming features into the core engine, we created a plugin ecosystem:
# Plugin interface enables community contributions
class ExtractorPlugin:
def extract(self, page, config):
"""Plugin extraction logic"""
pass
def validate_config(self, config):
"""Plugin config validation"""
pass
This enables features like:
- AI-powered selection: Natural language element targeting via OpenAI/Ollama
- JSON-LD extraction: Structured data from schema.org markup
- API interception: Capturing AJAX/REST calls during page interaction
Modern Browser Integration
Moving from Firefox 50 constraints to Playwright opened up possibilities:
{
"@url": "https://dynamic-site.com",
"@actions": [
{"@type": "wait", "@until": "element", "@selector": ".loaded"},
{"@type": "javascript", "@code": "window.loadMoreContent();"},
{"@type": "scroll", "@direction": "down", "@pixels": 500}
],
"@steps": [
{
"@ai-select": "product prices with discounts",
"@name": "prices"
}
]
}
The engine now handles dynamic content, JavaScript execution, and complex interactions that were impossible with the original architecture.

Building for Community
Testing as Foundation
Academic code often ships without tests. Making DR Web Engine community-ready meant comprehensive test coverage:
- 200+ tests across unit, integration, and end-to-end scenarios
- Mock browser client for CI environments without display servers
- Plugin testing framework for community-contributed extensions
- Automated testing on every pull request and release
# Example test structure enabling confident contributions
class TestActionProcessor:
def test_click_action_execution(self):
"""Test click actions work correctly"""
def test_javascript_error_handling(self):
"""Test JavaScript execution error paths"""
def test_conditional_logic_branching(self):
"""Test if/then/else query logic"""
Documentation Strategy
Research documentation often assumes domain expertise. Community documentation needs to work for newcomers:
- Getting Started Guide: 20+ examples from basic to advanced
- Comprehensive examples directory: Organized by use case (e-commerce, news, social media)
- Plugin development guides: Enable community extension development
- API reference: Complete keyword and configuration documentation
The examples evolved from research-focused demos to practical patterns:
// Before: Research demo
{"@url": "test-site.com", "@steps": [...]}
// After: Production pattern
{
"@url": "https://real-ecommerce.com/products",
"@actions": [
{"@type": "wait", "@until": "network-idle", "@timeout": 10000}
],
"@steps": [
{
"@xpath": "//div[@class='product-card']",
"@fields": {
"name": ".//h3[@class='product-title']/text()",
"price": ".//span[@class='price']/text()",
"rating": ".//div[@class='rating']/@data-rating"
}
}
],
"@pagination": {
"@xpath": "//a[contains(@class, 'next-page')]",
"@limit": 5
}
}
AI Integration and Accessibility
One unexpected direction was AI integration - not for the sake of jumping on trends, but for solving a real usability problem: XPath barrier to entry.
Natural Language Selection
Instead of requiring XPath expertise:
{
"@url": "https://shopping-site.com",
"@steps": [
{
"@ai-select": "customer review text and ratings",
"@name": "reviews",
"@max-results": 10
}
]
}
This works with both OpenAI’s API and local Ollama models, making the tool accessible to users who understand data extraction but not XPath syntax.
LLM-Optimized Outputs
The engine now produces formats optimized for AI training workflows:
- JSONL streaming: Perfect for large dataset processing
- OpenAI Chat format: Ready for fine-tuning pipelines
- Anthropic format: Compatible with Claude training data
This bridges web scraping with modern AI development workflows.
Academic Validation and Community Building
arXiv Publication
Transitioning from prototype to community project required academic credibility. We published a comprehensive paper comparing DR Web Engine with existing tools, including performance benchmarks and technical innovations.
The paper serves multiple purposes:
- Academic legitimacy for researchers considering the tool
- Technical documentation of design decisions and innovations
- Benchmark baseline for future improvements and comparisons
Early Community Response
While adoption is still growing, early feedback has been encouraging:
- Plugin contributions: Community members extending functionality
- Use case diversity: Applications beyond my original research scope
- Technical feedback: Issues and improvements from real-world usage
The most valuable feedback has been about use cases I hadn’t considered - from monitoring competitor pricing to extracting structured data for AI training datasets.
Lessons in Research-to-Production Evolution
What Works
Incremental public development: Publishing early and iterating based on feedback proved more effective than trying to build everything privately first.
Plugin architecture from day one: Even simple features benefit from modular design when you’re building for community contribution.
Comprehensive documentation: Time invested in examples and guides pays dividends in community engagement and reduced support burden.
What’s Challenging
Balancing simplicity with power: Advanced features can complicate the simple use cases that attract new users.
Managing backwards compatibility: Early API decisions become difficult to change once people depend on them.
Community governance: Establishing contribution standards and review processes without discouraging participation.
Academic Software Transition Points
The journey highlighted specific moments where academic code needs professional software practices:
- Testing: When you want others to confidently modify code
- Documentation: When you want users beyond yourself
- Architecture: When you want features beyond your original vision
- Governance: When you want sustainable community contribution
Current State and Future Direction
DR Web Engine has evolved from a research tool solving my data collection problems to a platform enabling broader web data accessibility. The plugin system allows community-driven feature development, while AI integration reduces technical barriers.
Key technical capabilities now include:
- Advanced interaction handling: JavaScript execution, form automation, dynamic content
- Conditional extraction logic: Smart branching based on page states
- Recursive navigation: Following link chains with cycle detection
- Multiple output formats: From JSON to AI training datasets
But the more significant evolution is cultural: from research prototype to community project with sustainable development practices, comprehensive documentation, and pathways for contribution.
What’s Next
The roadmap focuses on community enablement rather than just feature addition:
- Plugin marketplace: Making community extensions discoverable
- Educational content: Workshops and tutorials for web scraping newcomers
- Integration examples: Connecting with popular data processing workflows
- Performance optimization: Scaling to larger extraction tasks
Reflections on Open Source Evolution
This evolution taught me that successful open source isn’t just about releasing code - it’s about building sustainable systems for collaboration. The technical transformation from prototype to production was significant, but the cultural transformation from personal tool to community project was equally important.
Academic software often stays locked in institutional silos not because the research isn’t valuable, but because the transition costs are underestimated. Making research code community-ready requires rethinking architecture, documentation, testing, and governance from scratch.
DR Web Engine’s journey from PhD research tool to open source platform demonstrates that academic software can successfully evolve into community projects - but it requires intentional investment in the infrastructure of collaboration, not just the technology itself.
The result is software that serves broader needs while maintaining the research insights that sparked its creation. Sometimes the best way to advance research is to build the tools that enable others to explore alongside you.