A versatile, high-performance Natural Language Processing (NLP) toolkit written entirely in Go (Golang). The project provides a command-line utility for training and utilizing foundational NLP models, including Word2Vec embeddings, a sophisticated Mixture of Experts (MoE) model, and a practical Intent Classifier.
Note: This project is currently in a beta stage and is under active development. The API and functionality are subject to change. Accuracy is not the primary focus at this stage, as the main goal is to explore and implement these NLP models in Go.
- 🌐 Project Website
- ✨ Key Features
- 🚀 Getting Started
- 🛠️ Usage
- ⚙️ Project Structure
- 📊 Data & Configuration
- 🗺️ Roadmap
- 🤝 Contributing
- 📜 License
- 🙏 Special Thanks
- Why Go?
The application is structured as a dispatcher that runs specialized modules for various NLP tasks:
- Word2Vec Training: Generate high-quality distributed word representations (embeddings) from a text corpus.
- Mixture of Experts (MoE) Architecture: Train a powerful MoE model, designed for improved performance, scalability, and handling of complex sequential or structural data.
- Intent Classification: Develop a model for accurately categorizing user queries into predefined semantic intents.
- Efficient Execution: Built in Go, leveraging its performance and concurrency features for fast training and inference.
You need a working Go environment (version 1.25 or higher is recommended) installed on your system.
-
Clone the repository:
git clone https://github.com/golangast/nlptagger.git cd nlptagger
You can build the executable from the root of the project directory:
go build .
This will create an nlptagger
executable in the current directory.
The main executable (nlptagger
or main.go
) controls all operations using specific command-line flags. All commands should be run from the root directory of the project.
Use the respective flags to initiate the training process. Each flag executes a separate module located in the cmd/
directory.
Model | Flag | Command |
---|---|---|
Word2Vec | --train-word2vec |
go run main.go --train-word2vec |
Mixture of Experts (MoE) | --train-moe |
go run main.go --train-moe |
Intent Classifier | --train-intent-classifier |
go run main.go --train-intent-classifier |
To run predictions using a previously trained MoE model, use the --moe_inference
flag and pass the input query string.
Action | Flag | Command Example |
---|---|---|
MoE Inference | --moe_inference |
go run main.go --moe_inference "schedule a meeting with John for tomorrow at 2pm" |
To run the command generation logic based on intent predictions and NER/POS tags, use the example/main.go
with a query:
go run ./example/main.go -query "create a file named jack in folder named jill"
Expected Output:
--- Top 3 Parent Intent Predictions ---
- Word: webserver_creation (Confidence: 29.17%)
- Word: version_control (Confidence: 20.43%)
- Word: deployment (Confidence: 10.19%)
------------------------------------
--- Top 3 Child Intent Predictions ---
- Word: create (Confidence: 17.68%)
- Word: create_data_structure (Confidence: 14.66%)
- Word: create_server_and_handler (Confidence: 11.70%)
-----------------------------------
Description: The model's top prediction is an action related to webserver_creation, specifically to create.
Intent Sentence: Not found in training data.
--- POS Tagging Results ---
Tokens: [create a file named jack in folder named jill]
POS Tags: [CODE_VB DET IN NN NN IN NN VBN NN]
--- NER Tagging Results ---
Tokens: [create a file named jack in folder named jill]
NER Tags: [COMMAND DETERMINER OBJECT_TYPE NAME_PREFIX NAME PREPOSITION OBJECT_TYPE NAME_PREFIX NAME]
--- Generating Command ---
Generated Command: mkdir -p jill && touch jill/jack
This project is more than just command-line tools. It's a collection of Go packages. You can use these packages in your own Go projects.
Example usage is in the /example folder.
package main
import (
"encoding/json"
"flag"
"fmt"
"io"
"log"
"math" // Keep math for softmax
"math/rand"
"os"
"os/exec"
"sort"
"nlptagger/neural/moe"
mainvocab "nlptagger/neural/nnu/vocab"
"nlptagger/neural/tensor"
"nlptagger/neural/tokenizer"
"nlptagger/tagger/nertagger"
"nlptagger/tagger/postagger"
"nlptagger/tagger/tag"
)
// IntentTrainingExample represents a single training example for intent classification.
type IntentTrainingExample struct {
Query string `json:"query"`
ParentIntent string `json:"parent_intent"`
ChildIntent string `json:"child_intent"`
Description string `json:"description"`
Sentence string `json:"sentence"`
}
// IntentTrainingData represents the structure of the intent training data JSON.
type IntentTrainingData []IntentTrainingExample
// LoadIntentTrainingData loads the intent training data from a JSON file.
func LoadIntentTrainingData(filePath string) (*IntentTrainingData, error) {
file, err := os.Open(filePath)
if err != nil {
return nil, fmt.Errorf("failed to open training data file %s: %w", filePath, err)
}
defer file.Close()
bytes, err := io.ReadAll(file)
if err != nil {
return nil, fmt.Errorf("failed to read training data file %s: %w", filePath, err)
}
var data IntentTrainingData
err = json.Unmarshal(bytes, &data)
if err != nil {
return nil, fmt.Errorf("failed to unmarshal training data JSON from %s: %w", filePath, err)
}
return &data, nil
}
type Prediction struct {
TokenID int
Word string
Confidence float64
}
func getTopNPredictions(probabilities []float64, vocab []string, n int) []Prediction {
predictions := make([]Prediction, 0, len(probabilities))
for i, p := range probabilities {
if i < 2 { // Skip <pad> and UNK
continue
}
if i < len(vocab) {
word := vocab[i]
predictions = append(predictions, Prediction{
TokenID: i,
Word: word,
Confidence: p * 100.0,
})
}
}
// Sort predictions by confidence
sort.Slice(predictions, func(i, j int) bool {
return predictions[i].Confidence > predictions[j].Confidence
})
if len(predictions) < n {
return predictions
}
return predictions[:n]
}
var (
query = flag.String("query", "", "Query for MoE inference")
maxSeqLength = flag.Int("maxlen", 32, "Maximum sequence length")
)
func main() {
rand.Seed(1) // Seed the random number generator for deterministic behavior
flag.Parse()
if *query == "" {
log.Fatal("Please provide a query using the -query flag.")
}
// Define paths
const vocabPath = "gob_models/query_vocabulary.gob"
const moeModelPath = "gob_models/moe_classification_model.gob"
const parentIntentVocabPath = "gob_models/parent_intent_vocabulary.gob"
const childIntentVocabPath = "gob_models/child_intent_vocabulary.gob"
const intentTrainingDataPath = "trainingdata/intent_data.json"
// Load vocabularies
vocabulary, err := mainvocab.LoadVocabulary(vocabPath)
if err != nil {
log.Fatalf("Failed to set up input vocabulary: %v", err)
}
// Setup parent intent vocabulary
parentIntentVocabulary, err := mainvocab.LoadVocabulary(parentIntentVocabPath)
if err != nil {
log.Fatalf("Failed to set up parent intent vocabulary: %v", err)
}
// Setup child intent vocabulary
childIntentVocabulary, err := mainvocab.LoadVocabulary(childIntentVocabPath)
if err != nil {
log.Fatalf("Failed to set up child intent vocabulary: %v", err)
}
// Create tokenizer
tok, err := tokenizer.NewTokenizer(vocabulary)
if err != nil {
log.Fatalf("Failed to create tokenizer: %w", err)
}
// Load the trained MoEClassificationModel model
model, err := moe.LoadIntentMoEModelFromGOB(moeModelPath)
if err != nil {
log.Fatalf("Failed to load MoE model: %v", err)
}
// Load intent training data
intentTrainingData, err := LoadIntentTrainingData(intentTrainingDataPath)
if err != nil {
log.Fatalf("Failed to load intent training data: %v", err)
}
log.Printf("--- DEBUG: Parent Intent Vocabulary (TokenToWord): %v ---", parentIntentVocabulary.TokenToWord)
log.Printf("--- DEBUG: Child Intent Vocabulary (TokenToWord): %v ---", childIntentVocabulary.TokenToWord)
log.Printf("Running MoE inference for query: \"%s\"", *query)
// Encode the query
tokenIDs, err := tok.Encode(*query)
if err != nil {
log.Fatalf("Failed to encode query: %v", err)
}
// Pad or truncate the sequence to a fixed length
if len(tokenIDs) > *maxSeqLength {
tokenIDs = tokenIDs[:*maxSeqLength] // Truncate from the end
} else {
for len(tokenIDs) < *maxSeqLength {
tokenIDs = append(tokenIDs, vocabulary.PaddingTokenID) // Appends padding
}
}
inputData := make([]float64, len(tokenIDs))
for i, id := range tokenIDs {
inputData[i] = float64(id)
}
inputTensor := tensor.NewTensor([]int{1, len(inputData)}, inputData, false) // RequiresGrad=false for inference
// Create a dummy target tensor for inference, as the Forward method expects two inputs.
// The actual content of this tensor won't be used for parent/child intent classification.
dummyTargetTokenIDs := make([]float64, *maxSeqLength)
for i := 0; i < *maxSeqLength; i++ {
dummyTargetTokenIDs[i] = float64(vocabulary.PaddingTokenID)
}
dummyTargetTensor := tensor.NewTensor([]int{1, *maxSeqLength}, dummyTargetTokenIDs, false)
// Forward pass
parentLogits, childLogits, _, _, err := model.Forward(inputTensor, dummyTargetTensor)
if err != nil {
log.Fatalf("MoE model forward pass failed: %v", err)
}
// Interpret parent intent output
parentProbabilities := softmax(parentLogits.Data)
topParentPredictions := getTopNPredictions(parentProbabilities, parentIntentVocabulary.TokenToWord, 3)
fmt.Println("--- Top 3 Parent Intent Predictions ---")
for _, p := range topParentPredictions {
importance := ""
if p.Confidence > 50.0 {
importance = " (Important)"
}
fmt.Printf(" - Word: %-20s (Confidence: %.2f%%)%s\n", p.Word, p.Confidence, importance)
}
fmt.Println("------------------------------------")
// Interpret child intent output
childProbabilities := softmax(childLogits.Data)
topChildPredictions := getTopNPredictions(childProbabilities, childIntentVocabulary.TokenToWord, 3)
fmt.Println("--- Top 3 Child Intent Predictions ---")
for _, p := range topChildPredictions {
importance := ""
if p.Confidence > 50.0 {
importance = " (Important)"
}
fmt.Printf(" - Word: %-20s (Confidence: %.2f%%)%s\n", p.Word, p.Confidence, importance)
}
fmt.Println("-----------------------------------")
if len(topParentPredictions) > 0 && len(topChildPredictions) > 0 {
predictedParentWord := topParentPredictions[0].Word
predictedChildWord := topChildPredictions[0].Word
fmt.Printf("\nDescription: The model's top prediction is an action related to %s, specifically to %s.\n", predictedParentWord, predictedChildWord)
// Find and print the intent sentence
foundSentence := ""
for _, example := range *intentTrainingData {
if example.ParentIntent == predictedParentWord && example.ChildIntent == predictedChildWord {
foundSentence = example.Sentence
break
}
}
if foundSentence != "" {
fmt.Printf("Intent Sentence: %s\n", foundSentence)
} else {
fmt.Println("Intent Sentence: Not found in training data.")
}
}
// Perform POS tagging
posResult := postagger.Postagger(*query)
fmt.Println("\n--- POS Tagging Results ---")
fmt.Printf("Tokens: %v\n", posResult.Tokens)
fmt.Printf("POS Tags: %v\n", posResult.PosTag)
// Perform NER tagging
nerResult := nertagger.Nertagger(posResult)
fmt.Println("\n--- NER Tagging Results ---")
fmt.Printf("Tokens: %v\n", nerResult.Tokens)
fmt.Printf("NER Tags: %v\n", nerResult.NerTag)
// Generate and execute command based on NER/POS tags and intent predictions
fmt.Println("\n--- Generating Command ---")
command := generateCommand("file_system", topChildPredictions[0].Word, nerResult)
if command != "" {
fmt.Printf("Generated Command: %s\n", command)
// Execute the command
cmd := exec.Command("bash", "-c", command)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
err := cmd.Run()
if err != nil {
log.Printf("Error executing command: %v", err)
}
} else {
fmt.Println("Could not generate a command.")
}
}
// softmax applies the softmax function to a slice of float64.
func softmax(logits []float64) []float64 {
if len(logits) == 0 {
return []float64{}
}
maxLogit := logits[0]
for _, logit := range logits {
if logit > maxLogit {
maxLogit = logit
}
}
expSum := 0.0
for _, logit := range logits {
expSum += math.Exp(logit - maxLogit)
}
probabilities := make([]float64, len(logits))
for i, logit := range logits {
probabilities[i] = math.Exp(logit-maxLogit) / expSum
}
return probabilities
}
func generateCommand(parentIntent, childIntent string, nerResult tag.Tag) string {
switch parentIntent {
case "file_system":
switch childIntent {
case "create":
var fileName, folderName string
for i, tag := range nerResult.NerTag {
if tag == "OBJECT_TYPE" && nerResult.Tokens[i] == "file" {
if i+2 < len(nerResult.Tokens) && nerResult.NerTag[i+1] == "NAME_PREFIX" && nerResult.NerTag[i+2] == "NAME" {
fileName = nerResult.Tokens[i+2]
}
} else if tag == "OBJECT_TYPE" && nerResult.Tokens[i] == "folder" {
if i+2 < len(nerResult.Tokens) && nerResult.NerTag[i+1] == "NAME_PREFIX" && nerResult.NerTag[i+2] == "NAME" {
folderName = nerResult.Tokens[i+2]
}
}
}
if fileName != "" && folderName != "" {
return fmt.Sprintf("mkdir -p %s && touch %s/%s", folderName, folderName, fileName)
} else if fileName != "" {
return fmt.Sprintf("touch %s", fileName)
}
}
// Add other file_system child intents here (e.g., "delete", "read")
}
// Add other parent intents here
return ""
}
The output is usable and structured.
The neural/
and tagger/
directories contain the reusable components. Import them as needed.
The project is a collection of tools. Its structure reflects this.
nlptagger/
├── main.go # Dispatches to common tools.
├── go.mod # Go module definition.
├── cmd/ # Each subdirectory is a command-line tool.
│ ├── train_word2vec/ # Example: Word2Vec training.
│ └── moe_inference/ # Example: MoE inference.
├── neural/ # Core neural network code.
├── tagger/ # NLP tagging components.
├── trainingdata/ # Sample data for training.
└── gob_models/ # Saved models.
- Data Structure: Training modules look for data files in the
trainingdata/
directory. For example,intent_data.json
is used for intent classification training. - Configuration: Model hyperparameters (learning rate, epochs, vector size, etc.) are currently hardcoded within their respective training modules in the
cmd/
directory. This is an area for future improvement. - Model Output: Trained models are saved as
.gob
files to thegob_models/
directory by default.
This project is under active development. Here are some of the planned features and improvements:
- Implement comprehensive unit and integration tests.
- Add more NLP tasks (e.g., Named Entity Recognition, Part-of-Speech tagging).
- Externalize model configurations from code into files (e.g., YAML, JSON).
- Improve model accuracy and performance.
- Enhance documentation with more examples and API references.
- Create a more user-friendly command-line interface.
We welcome contributions! Please feel free to open issues for bug reports or feature requests, or submit pull requests for any enhancements.
- Fork the repository.
- Create a new branch (
git checkout -b feature/AmazingFeature
). - Commit your changes (
git commit -m '''Add AmazingFeature'''
). - Push to the branch (
git push origin feature/AmazingFeature
). - Open a Pull Request.
Note on Tests: There is currently a lack of automated tests. Contributions in this area are highly encouraged and appreciated!
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.
- The Go Team and contributors for creating and maintaining Go.
Go is a great choice for this project for several reasons:
- Stability: The language has a strong compatibility promise. What you learn now will be useful for a long time. (Go 1 Compatibility Promise)
- Simplicity and Readability: Go's simple syntax makes it easy to read and maintain code.
- Performance: Go is a compiled language with excellent performance, which is crucial for NLP tasks.
- Concurrency: Go's built-in concurrency features make it easy to write concurrent code for data processing and model training.
- Strong Community and Ecosystem: Go has a growing community and a rich ecosystem of libraries and tools. (Go User Community)