Skip to content

JSON Schema Generation and Validation #50

Open
@butschster

Description

@butschster

This proposal introduces a class-based system for JSON schema generation and validation in the Context Generator. Rather than duplicating information in separate schema files, we'll use PHP class definitions as the single source of truth, extracting schema information directly from class structure, type hints, and targeted attributes.

2. Design Goals

  • Establish a single source of truth for both PHP objects and JSON schema
  • Leverage PHP 8.3's native features (type hints, constructor property promotion, etc.) whenever possible
  • Use attributes strategically only where native PHP features are insufficient
  • Generate comprehensive JSON schema for validation and IDE integration
  • Enable automatic mapping between JSON and PHP objects
  • Provide a consistent approach for all schema components (documents, sources, modifiers)

3. Key Components

3.1 Class Hierarchy

The schema structure will be represented by a hierarchy of PHP classes:

JsonSchema (root)
├── Documents[]
│   ├── Sources[]
│   │   ├── FileSource
│   │   ├── GithubSource
│   │   ├── GitDiffSource
│   │   ├── UrlSource
│   │   └── TextSource
│   └── Modifiers[]
│       ├── PhpSignatureModifier
│       ├── PhpContentFilterModifier
│       ├── PhpDocsModifier
│       └── SanitizerModifier
└── Settings
    └── ModifierAliases{}

3.2 Core Attributes

We'll define a set of attributes to enhance the schema information that can't be expressed through PHP's type system alone:

// Marks classes that should be included in schema generation
#[Attribute(Attribute::TARGET_CLASS)]
class SchemaType {
    public function __construct(
        public readonly string $name,
        public readonly string $description = '',
        public readonly bool $isRoot = false,
    ) {}
}

// Skip properties from schema generation
#[Attribute(Attribute::TARGET_PROPERTY)]
class Skip {}

// Override property details for schema
#[Attribute(Attribute::TARGET_PROPERTY)]
class Property {
    public function __construct(
        public readonly ?string $name = null,
        public readonly ?string $description = null,
        public readonly ?string $format = null,
    ) {}
}

// Mark property as required in schema
#[Attribute(Attribute::TARGET_PROPERTY)]
class Required {}

// Add pattern validation for string properties
#[Attribute(Attribute::TARGET_PROPERTY)]
class Pattern {
    public function __construct(
        public readonly string $pattern,
        public readonly ?string $description = null
    ) {}
}

// Define enumerated values
#[Attribute(Attribute::TARGET_PROPERTY)]
class Enum {
    public function __construct(public readonly array $values) {}
}

// Define the structure of array items
#[Attribute(Attribute::TARGET_PROPERTY)]
class Items {
    public function __construct(
        public readonly string $type,
        public readonly ?string $ref = null,
        public readonly ?array $enum = null,
    ) {}
}

// Define a property as reference to another schema type
#[Attribute(Attribute::TARGET_PROPERTY)]
class Reference {
    public function __construct(
        public readonly string $type,
    ) {}
}

4. Schema Generation Process

The schema generation process will involve:

  1. Finding all classes marked with #[SchemaType] attribute
  2. Using reflection to analyze class properties, types, and attributes
  3. Mapping PHP types to JSON schema types
  4. Building a complete schema structure from the class hierarchy
  5. Serializing the schema to JSON

4.1 Type Mapping

PHP types will be mapped to JSON schema types:

  • int, floatnumber
  • stringstring
  • boolboolean
  • arrayarray
  • objectobject
  • Class types → References to defined schemas

4.2 Schema Generator Implementation

class SchemaGenerator
{
    private array $definitions = [];
    private array $processedClasses = [];
    
    public function generateSchema(): array
    {
        // Find all classes with SchemaType attribute
        $schemaClasses = $this->findSchemaClasses();
        
        // Process root schema class first
        $rootClass = $this->findRootSchemaClass($schemaClasses);
        $this->processClass(new \ReflectionClass($rootClass));
        
        // Process all remaining classes
        foreach ($schemaClasses as $class) {
            if ($class !== $rootClass) {
                $this->processClass(new \ReflectionClass($class));
            }
        }
        
        return [
            '$schema' => 'http://json-schema.org/draft-07/schema#',
            'fileMatch' => ['context.json'],
            'title' => 'Context Generator Configuration',
            'description' => 'Configuration schema for Context Generator',
            'type' => 'object',
            'required' => ['documents'],
            'properties' => [
                'documents' => [
                    'type' => 'array',
                    'description' => 'List of documents to generate',
                    'items' => ['$ref' => '#/definitions/document']
                ],
                'settings' => [
                    'type' => 'object',
                    'description' => 'Global settings',
                    '$ref' => '#/definitions/settings'
                ]
            ],
            'definitions' => $this->definitions
        ];
    }
    
    private function processClass(\ReflectionClass $class): void
    {
        // Implementation details...
    }
    
    private function processProperty(\ReflectionProperty $property, array &$schema): void
    {
        // Implementation details...
    }
    
    // Additional helper methods...
}

5. Implementation Example

5.1 Source Type Definition

Here's how a source type would be defined using this approach:

#[SchemaType(name: 'fileSource', description: 'File source - includes content from local filesystem')]
final class FileSource extends AbstractSourceWithModifiers implements FilterableSourceInterface
{
    public function __construct(
        #[Required]
        #[Property(description: 'Path(s) to directory or files to include')]
        public readonly string|array $sourcePaths,
        
        string $description = '',
        
        #[Property(description: 'File pattern(s) to match')]
        public readonly string|array $filePattern = '*.*',
        
        #[Property(description: 'Patterns to exclude files')]
        public readonly array $notPath = [],
        
        #[Property(description: 'Patterns to include only files in specific paths')]
        public readonly string|array $path = [],
        
        #[Property(description: 'Patterns to include only files containing specific content')]
        public readonly string|array $contains = [],
        
        #[Property(description: 'Patterns to exclude files containing specific content')]
        public readonly string|array $notContains = [],
        
        #[Pattern(pattern: '^[<>]=?\\s+\\d+[KMG]i?$|^since\\s+.+$', 
                 description: 'Size constraints (e.g., "> 10K", "< 1M")')]
        #[Property(description: 'Size constraints for files')]
        public readonly string|array $size = [],
        
        #[Property(description: 'Whether to display a directory tree visualization')]
        public readonly bool $showTreeView = true,
        
        array $modifiers = [],
    ) {
        parent::__construct($description, $modifiers);
    }
    
    // Method implementations...
}

5.2 Modifier Definition

Similarly, here's how a modifier would be defined:

#[SchemaType(name: 'phpContentFilterModifier', description: 'PHP content filter modifier - filter PHP class elements based on criteria')]
final class PhpContentFilterModifier implements ModifierInterface
{
    public function __construct(
        #[Items(type: 'string')]
        #[Property(description: 'Method names to include (empty means include all unless exclude_methods is set)')]
        public readonly array $includeMethods = [],
        
        #[Items(type: 'string')]
        #[Property(description: 'Method names to exclude')]
        public readonly array $excludeMethods = [],
        
        #[Items(type: 'string', enum: ['public', 'protected', 'private'])]
        #[Property(description: 'Method visibilities to include')]
        public readonly array $methodVisibility = ['public', 'protected', 'private'],
        
        #[Property(description: 'Whether to keep method bodies or replace with placeholders')]
        public readonly bool $keepMethodBodies = false,
        
        // Additional properties...
    ) {}
    
    // Method implementations...
}

6. Integration with Existing Code

6.1 Loading Configuration

The existing configuration loader will be enhanced to use the new object-oriented model:

class JsonConfigDocumentsLoader implements DocumentsLoaderInterface
{
    // Existing code...
    
    public function load(): DocumentRegistry
    {
        // Read and parse JSON as before
        $jsonContent = $this->files->read($this->configPath);
        $config = \json_decode($jsonContent, true, flags: JSON_THROW_ON_ERROR);
        
        // Use new object mapper to create schema objects
        $objectMapper = new ObjectMapper();
        $jsonSchema = $objectMapper->mapToObject($config, JsonSchema::class);
        
        // Convert to DocumentRegistry for backward compatibility
        return $this->convertToDocumentRegistry($jsonSchema);
    }
    
    // Additional methods...
}

6.2 Validation

The schema objects can be used for validation:

class JsonValidator
{
    private SchemaGenerator $schemaGenerator;
    
    public function __construct(SchemaGenerator $schemaGenerator)
    {
        $this->schemaGenerator = $schemaGenerator;
    }
    
    public function validate(array $data): array
    {
        $schema = $this->schemaGenerator->generateSchema();
        
        // Use JSON Schema validation library
        $validator = new \JsonSchema\Validator();
        $validator->validate($data, $schema);
        
        return $validator->getErrors();
    }
}

7. Benefits of This Approach

  1. Single Source of Truth: PHP classes define both runtime behavior and schema constraints
  2. Self-documenting: Class structure and attributes make the schema requirements clear
  3. Type Safety: PHP's type system provides first-class validation
  4. IDE Support: Class structure enables autocomplete and refactoring support
  5. Extensibility: Easy to add new source types or modifiers
  6. Maintainability: Changes to properties are reflected automatically in the schema

Metadata

Metadata

Assignees

No one assigned

    Labels

    config:loaderConfigLoader component for parsing and validating config filesconfig:parser1d76db for JSON, PHP, YAML config formatstype:enhancementNew feature or request

    Type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions