Skip to content

nabello/grabbit

 
 

Repository files navigation

Build Status

Download

Purpose

The purpose of Grabbit is to provide a fast and reliable way of copying content from one Sling (specifically Adobe CQ/AEM) instance to another.

Existing solutions have been tried and found insufficient for very large data sets (GB-TB), especially over a network. CQ's .zip packages are extremely space inefficient, causing a lot of extra I/O. vlt rcp and Mark Adamcin's recap use essentially the same mechanism: WebDAV using XML, doing an HTTP handshake for every node and many sets of properties, which means that any latency whatsoever on the network hurts performance enormously.

Grabbit creates a stream of data using Google's Protocol Buffers aka "ProtoBuf". Protocol Buffers are an extremely efficient (in terms of CPU, memory and wire size) binary protocol that includes compression.

Moreover, by doing a continuous stream, we avoid the latency issues. Depending on the size and nature of the data, as well as network latency, we have so far seen speed improvements ranging from 2 to 10 times that of Recap/vlt.

"Grabbit" obviously refers to this "grabbing" content from one CQ instance and copying it to another. However it also refers to "Jackrabbit," the reference JCR implementation that the content is being copied to and from.

Runtime Dependencies

  • AEM/CQ v5.6.1
  • To run Grabbit in your AEM/CQ instance, you need to install a Fragment Bundle once per instance. It can be found here

Building

Full Clean Build & Install

gradlew clean build install refreshAllBundles

Pre-requisites

Installing Protocol Buffers Compiler

For Windows

Install the binary from https://github.com/google/protobuf/releases/download/v2.4.1/protoc-2.4.1-win32.zip and then set the /path/to/protoc/parent on your PATH variable.

For Macs


brew tap homebrew/versions

brew install protobuf241

brew link --force --overwrite protobuf241

For both Windows and Mac : To verify that installation was successful, protoc --version should display 2.4.1

Running Grabbit

[This] (grabbit.sh) shell script can be used to initiate grabbit jobs for a given Grabbit configuration and to check the status of those jobs.

  1. Create a configuration file with the path(s) you wish to copy (see "Config Format" section below, you'll also set the server side URL and credentials in this step)
  2. Execute the grabbit.sh script (which comes with the codebase, by default lives in the root of the project)
  3. Enter the full address for the client instance including port (details for the server instance should be in the config fil you create in step 1)
  4. Enter Username for the client instance
  5. Enter the Password for the client instance
  6. Enter the path to the Grabbit config you created in step. Screenshot of how it will look like : alt text
  7. Once the Grabbit content sync is kicked off, you will get a confirmation of the kicked off jobs like : alt text

Config Format

A json configuration file of following format is used to configure Grabbit.

{
    "serverUsername" : "<username>",
    "serverPassword" : "<password>",
    "serverHost" : "some.other.server",
    "serverPort" : "4502",
    "pathConfigurations" :  [
        {
            "path" : "/content/someContent"
        },
        {
            "path" : "/content/dam/someDamContent",
            "workflowConfigIds" :
                [
                    "/etc/workflow/launcher/config/update_asset_mod",
                    "/etc/workflow/launcher/config/update_asset_create",
                    "/etc/workflow/launcher/config/dam_xmp_nested_writeback",
                    "/etc/workflow/launcher/config/dam_xmp_writeback"
                ]
        }
    ]
}

Monitoring / Validating the Content Sync

You can validate / monitor sync by going to the following URI for each job on your Grabbit Client:

/grabbit/job/<jobId>.json

This will give you the status of a particular job. It has the following format -

A job status is has the following format :

 {
       "endTime": "Timestamp",
       "exitStatus": {
           "exitCode": "Code",
           "exitDescription": "",
           "running": "true/false"
       },
       "jcrNodesWritten": "#OfNodes",
       "jobExecutionId": "JobExecutionId",
       "path": "currentPath",
       "startTime": "TimeStamp",
       "timeTaken": "TimeInMilliSeconds"
   }

Couple of points worth noting here: "exitCode" : This can have 3 states - UNKNOWN, COMPLETED, FAILED - UNKNOWN : Job is still running - COMPLETED : Job was completed successfully - FAILED : Job Failed "jcrNodesWritten" : This indicates how many nodes are currently written (increments by 1000) "timeTaken" : This will indicate the total time taken to complete content grab for currentPath

If exitCode returns as UNKNOWN, that means the job is still running and you should check for its status again.

Sample of a real Grabbit Job status

alt text

General Layout

There are two primary components to Grabbit: a client and a server that run in the two CQ instances that you want to copy to and from (respectively). The server has a GET Servlet to handle requests to grab data, and the client has a POST Servlet to manage starting jobs. Normally the client Servlet is used via a form, but can be called with "curl" or the like for scripted remote management.

A recommended systems layout style is to have all content from a production publisher copied down to a staging "data warehouse" (DW) server to which all lower environments (beta, continuous integration, developer workstations, etc.) will connect. That way minimal load is placed on Production, and additional DW machines can be added to scale out if needed, each of which can grab from the "main" DW. The client sends an HTTP GET request with a content path and "last grab time" to the server and receives a ProtoBuf stream of all the content below it that has changed. The client's BasicAuth credentials are used to create the JCR Session, so the client can never see content they don't have explicit access to. There are a number of ways to tune how the client works, including specifying multiple focused paths, parallel or serial execution, JCR Session batch size (the number of nodes to cache before flushing to disk), etc.

Library Attribution

LICENSE

Copyright 2015 Time Warner Cable, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

Grabbit - Fast Content Sync tool for AEM/CQ

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Groovy 99.1%
  • Other 0.9%