Allow spliting the data portion into multiple files. #101

Nodja · 2024-01-04T01:57:23Z

I wanted to host a chat analytics page on github pages since GH Pages seems ideal to host a static file like this, but the discord server I'm using is almost 400MB of data and there's a 100MB file size limit. I can use LFS but I've run into bandwidth issue in the past.

The solution would be to split the data into separate files.

I've managed to do this manually, but it would be ideal if this was supported natively, since my method is not optimal. Here's how I achieved it:

Generate a report.html file like normal
Extract the contents of the script tag with the 'data' id to a separate report.txt file, leave the <script> tag there empty as we're gonna update it later.
Split the file into 25MB chunks. I used a python script

split.py

chunk_size = 25 * 1024 * 1024

prefix = "report"

with open(prefix + ".txt", "rb") as input_file:
    chunk_number = 1
    while True:
        chunk_data = input_file.read(chunk_size)
        if not chunk_data:
            break

        output_file_name = f"{prefix}-{str(chunk_number).zfill(3)}.txt"

        with open(output_file_name, "wb") as output_file:
            output_file.write(chunk_data)

        chunk_number += 1

Load the split files using a new script tag, the script tag needs to be in the <head> tag so it loads before anything else. I used an external file like so: <script src="loader.js"></script> but inline is fine.

loader.js

var files = [
  "report-001.txt",
  "report-002.txt",
  "report-003.txt",
  "report-004.txt",
  "report-005.txt"
  // etc.
];

var data = "";

for (let i = 0; i < files.length; i++) {
  let file = files[i];
  let rawFile = new XMLHttpRequest();
  rawFile.open("GET", file, false);
  rawFile.onreadystatechange = function () {
    if (rawFile.readyState === 4) {
      if (rawFile.status === 200 || rawFile.status == 0) {
        data += rawFile.responseText;
      }
    }
  };
  rawFile.send(null);
}

var data_script = document.getElementById("data");
data_script.innerHTML = data;

That's it. The only issue is that rendering is completely locked until all the chunks are downloaded. It would be best if loading was supported natively instead of a hack like this.

The text was updated successfully, but these errors were encountered:

hopperelec · 2024-01-04T02:03:03Z

I think it would be good if logical partitions were used, too. That way, only the data needed for whichever tab a user is viewing would be loaded which would reduce bandwidth since you can assume most viewers (at least of a publicly hosted report, which is the only use case I can think of this) aren't going to be looking in every tab

mlomb · 2024-01-06T19:05:21Z

I think its a good feature, we can add an option like "split data files into [X MB] parts" or something. We can do it when I eventually get to the configuration UI, I'll leave this open 😄

That way, only the data needed for whichever tab a user is viewing would be loaded

Every card analysis is generated on the fly using the full database, we can't "split it by tab"

hopperelec · 2024-01-07T03:44:32Z

Every card analysis is generated on the fly using the full database, we can't "split it by tab"

I wasn't meaning splitting by tab, I was meaning splitting by data type. One file could have all the basic info such as all this, which are used in just about all tabs

chat-analytics/pipeline/process/Types.ts

Lines 12 to 26 in 055c68c

    
           config: Config; 
        
           generatedAt: string; 
        
           title: string; 
        
           langs: Language[]; 
        
           time: { 
        
               minDate: DateKey; 
        
               maxDate: DateKey; 
        
               // useful to know bucket sizes in advance 
        
               numDays: number; 
        
               numMonths: number; 
        
               numYears: number; 
        
           };

chat-analytics/pipeline/process/Types.ts

Lines 37 to 40 in 055c68c

    
           /** Messages are serialized and must be read using `readMessage` or the `MessageView` class */ 
        
           messages: Uint8Array; 
        
           numMessages: number; 
        
           bitConfig: MessageBitConfig;

but then another file could store words, which I believe is only used by the "Language" tab, and another file could store domains, which I believe is only used by the "Links" tab. Although, I do now realise that the majority of the data is probably just going to be words lol. However, I do still think it could be a good idea to store words in separate file(s) since most tabs don't require them, and only load those file(s) if the user visits a tab which does require them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow spliting the data portion into multiple files. #101

Allow spliting the data portion into multiple files. #101

Nodja commented Jan 4, 2024 •

edited

Loading

hopperelec commented Jan 4, 2024

mlomb commented Jan 6, 2024

hopperelec commented Jan 7, 2024

Allow spliting the data portion into multiple files. #101

Allow spliting the data portion into multiple files. #101

Comments

Nodja commented Jan 4, 2024 • edited Loading

hopperelec commented Jan 4, 2024

mlomb commented Jan 6, 2024

hopperelec commented Jan 7, 2024

Nodja commented Jan 4, 2024 •

edited

Loading