Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow spliting the data portion into multiple files. #101

Open
Nodja opened this issue Jan 4, 2024 · 3 comments
Open

Allow spliting the data portion into multiple files. #101

Nodja opened this issue Jan 4, 2024 · 3 comments

Comments

@Nodja
Copy link

Nodja commented Jan 4, 2024

I wanted to host a chat analytics page on github pages since GH Pages seems ideal to host a static file like this, but the discord server I'm using is almost 400MB of data and there's a 100MB file size limit. I can use LFS but I've run into bandwidth issue in the past.

The solution would be to split the data into separate files.

I've managed to do this manually, but it would be ideal if this was supported natively, since my method is not optimal. Here's how I achieved it:

  • Generate a report.html file like normal
  • Extract the contents of the script tag with the 'data' id to a separate report.txt file, leave the <script> tag there empty as we're gonna update it later.
  • Split the file into 25MB chunks. I used a python script
split.py
chunk_size = 25 * 1024 * 1024

prefix = "report"

with open(prefix + ".txt", "rb") as input_file:
    chunk_number = 1
    while True:
        chunk_data = input_file.read(chunk_size)
        if not chunk_data:
            break

        output_file_name = f"{prefix}-{str(chunk_number).zfill(3)}.txt"

        with open(output_file_name, "wb") as output_file:
            output_file.write(chunk_data)

        chunk_number += 1
  • Load the split files using a new script tag, the script tag needs to be in the <head> tag so it loads before anything else. I used an external file like so: <script src="loader.js"></script> but inline is fine.
loader.js
var files = [
  "report-001.txt",
  "report-002.txt",
  "report-003.txt",
  "report-004.txt",
  "report-005.txt"
  // etc.
];

var data = "";

for (let i = 0; i < files.length; i++) {
  let file = files[i];
  let rawFile = new XMLHttpRequest();
  rawFile.open("GET", file, false);
  rawFile.onreadystatechange = function () {
    if (rawFile.readyState === 4) {
      if (rawFile.status === 200 || rawFile.status == 0) {
        data += rawFile.responseText;
      }
    }
  };
  rawFile.send(null);
}

var data_script = document.getElementById("data");
data_script.innerHTML = data;

That's it. The only issue is that rendering is completely locked until all the chunks are downloaded. It would be best if loading was supported natively instead of a hack like this.

@hopperelec
Copy link
Contributor

I think it would be good if logical partitions were used, too. That way, only the data needed for whichever tab a user is viewing would be loaded which would reduce bandwidth since you can assume most viewers (at least of a publicly hosted report, which is the only use case I can think of this) aren't going to be looking in every tab

@mlomb
Copy link
Owner

mlomb commented Jan 6, 2024

I think its a good feature, we can add an option like "split data files into [X MB] parts" or something. We can do it when I eventually get to the configuration UI, I'll leave this open 😄


That way, only the data needed for whichever tab a user is viewing would be loaded

Every card analysis is generated on the fly using the full database, we can't "split it by tab"

@hopperelec
Copy link
Contributor

Every card analysis is generated on the fly using the full database, we can't "split it by tab"

I wasn't meaning splitting by tab, I was meaning splitting by data type. One file could have all the basic info such as all this, which are used in just about all tabs

config: Config;
generatedAt: string;
title: string;
langs: Language[];
time: {
minDate: DateKey;
maxDate: DateKey;
// useful to know bucket sizes in advance
numDays: number;
numMonths: number;
numYears: number;
};

/** Messages are serialized and must be read using `readMessage` or the `MessageView` class */
messages: Uint8Array;
numMessages: number;
bitConfig: MessageBitConfig;

but then another file could store words, which I believe is only used by the "Language" tab, and another file could store domains, which I believe is only used by the "Links" tab. Although, I do now realise that the majority of the data is probably just going to be words lol. However, I do still think it could be a good idea to store words in separate file(s) since most tabs don't require them, and only load those file(s) if the user visits a tab which does require them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants