Share: Notebook Utilities for Jupyter Notebooks

Share is a Python library designed to streamline workflows for machine learning engineers, data analysts, and data scientists working with Jupyter Notebooks. It provides a comprehensive set of utility functions for data preparation, validation, analysis, visualization, and file operations, all organized into modular classes for better maintainability and usability.

The core functionality is implemented in the NotebookUtilities class, which acts as a facade to delegate method calls to smaller, focused modules. This ensures backward compatibility while improving internal organization and scalability.

📂 Project Structure

The project is structured into modular components. Each component focuses on a specific aspect of notebook utilities.

To discover a natural structure, I performed a market basket analysis of all the instantiated functions in the notebooks containing them. I then used this to make a a directed graph, the edges representing the maximum confidence in each direction between pairs of functions (so that each node has one "in" edge and one "out" edge). I applied three generations of the Girvan-Newman algorithm to this graph. This analysis helped detect communities, which I used to organize the modules. (You can see some of this work in the Breaking Up notebook_utils notebook.)

Below is an overview of the structure:

notebook_utils.py: The original NotebookUtilities class, now a facade that delegates calls to the smaller, focused classes.
base_config.py: Contains the BaseConfig class, which provides common attributes and functions inherited by other modules.

Modules

1. Data Preparation

Module Name: data_preparation.py
Key Functions:
- get_first_year_element, get_evaluations, get_library_names, get_dir_tree, get_filename_from_url, get_style_column, get_td_parent, download_file, get_page_soup, get_page_tables, get_wiki_tables.
Description: Provides the DataPreparation class for functions related to data extraction and preparation workflows.

2. Data Validation

Module Name: data_validation.py
Key Functions:
- compute_similarity, conjunctify_nouns, check_for_typos, list_dfs_in_folder, get_relative_position, get_random_subdictionary, plot_inauguration_age.
Description: Contains the DataValidation class for functions related to validation and utility workflows.

3. Data Analysis and Visualization

Module Name: data_analysis.py
Key Functions:
- open_path_in_notepad, get_wiki_infobox_data_frame, get_inf_nan_mask, get_column_descriptions, modalize_columns, get_regexed_columns, get_regexed_dataframe, one_hot_encode, get_flattened_dictionary, get_numeric_columns, get_euclidean_distance, get_nearest_neighbor, get_minority_combinations, plot_line_with_error_bars, plot_histogram, get_r_squared_value_latex, get_spearman_rho_value_latex, first_order_linear_scatterplot.
Description: Provides the DataAnalysis class for functions related to data analysis and visualization.

4. File Operations

Module Name: file_operations.py
Key Functions:
- get_utility_file_functions, get_notebook_functions_dictionary, get_notebook_functions_set, show_duplicated_util_fns_search_string, show_dupl_fn_defs_search_string, delete_ipynb_checkpoint_folders, get_random_py_file, attempt_to_pickle, csv_exists, load_csv, pickle_exists, load_object, load_data_frames, save_data_frames, store_objects, get_random_function.
Description: Contains the FileOperations class for functions related to file cleanup and basic file operations.

5. Sequence Analysis

Module Name: sequence_analysis.py
Key Functions:
- split_list_by_gap, count_ngrams, get_sequences_by_count, split_list_by_exclusion, convert_strings_to_integers, get_ndistinct_subsequences, get_turbulence, replace_consecutive_elements, count_swaps_to_perfect_order, get_color_cycled_list, plot_sequence, plot_sequences.
Description: Contains the SequenceAnalysis class for analyzing, manipulating, and visualizing sequences, including numeric, string, and plotting utilities.

🚀 Features

Backward Compatibility: The NotebookUtilities class retains its original interface, ensuring existing notebooks can continue using it without modification.
Modular Design: Functionality is split into smaller, focused classes for better organization and maintainability.
Comprehensive Utility Functions: Covers a wide range of tasks, including data preparation, validation, analysis, visualization, and file operations.
Scalable and Extensible: The modular structure makes it easy to add new functionality or extend existing features.

📦 Installation

To use the share library, clone the repository and install the required dependencies:

Clone the repository and add it as a submodule to your notebooks:

repo="path/to/your/notebooks"
cd "$repo"
cd ../
git clone https://github.com/dbabbitt/share.git
if [ -d "$repo/.git" ]; then
    cd "$repo"
    repo_name=$(basename "$repo")
    # Update the submodule if it already exists
    if [ -d "share" ]; then
        echo "Updating share submodule in repository: $repo_name..."
        git submodule update --remote --merge
    # Add it if the submodule doesn't exist
    else
        echo "Adding share submodule in repository: $repo_name..."
        git submodule add -f https://github.com/dbabbitt/share.git share
        git commit -m "Added share submodule"
    fi
fi

Navigate to the project directory:
```
cd share
```
Install dependencies:
```
pip install -r requirements.txt
```

To remove the share submodule from your repository:

repo="path/to/your/repository"
if [ -d "$repo/.git" ]; then
    cd "$repo"

    # Check if the submodule exists
    if [ -d "share" ]; then
        repo_name=$(basename "$repo")
        echo "Removing submodule in repository: $repo_name..."
        git config -f .gitmodules --remove-section submodule.share
        git rm --cached share

        # Fully remove the share folder
        rm -rf share

        git commit -m "Removed submodule share"
    fi

fi

🛠️ Usage

Here’s how you can use the NotebookUtilities class in your Jupyter notebook:

Import the NotebookUtilities class:

import os.path as osp
import os
import sys

# Assuming here you're running this in a Jupyter notebook cell in a subfolder
shared_folder = osp.abspath(osp.join(os.pardir, 'share'))
assert osp.exists(shared_folder), f"The share submodule is not at {shared_folder}"
if shared_folder not in sys.path: sys.path.insert(1, shared_folder)

from notebook_utils import NotebookUtilities

Initialize the class:

nu = NotebookUtilities(

    # This will create a data folder if there isn't one already
    data_folder_path=osp.abspath(osp.join(os.pardir, 'data')),

    # This will create a saves folder if there isn't one already
    saves_folder_path=osp.abspath(osp.join(os.pardir, 'saves'))

)

Use the utility functions as needed:

import random
from tqdm import tqdm
from colormath.color_objects import LabColor

# Generate a random fixed point in the CIELAB color space
ranges = [(0, 100), (-128, 127), (-128, 127)]
fixed_point = tuple([random.uniform(a, b) for a, b in ranges])

# Generate a random number of additional points to spread, between 1 and 15
point_count = random.randint(1, 15)

# Initialize a list to store trial results
trials = []
total_trials = 5  # You can adjust this based on your patience

# Perform trials to find the best spread of points
with tqdm(total=total_trials) as pbar:
    while len(trials) < total_trials:

        # Attempt to spread the points evenly within a unit cube
        try:
            spread_points = nu.spread_points_in_cube(
                point_count, fixed_point, *ranges, verbose=False
            )

            # Ensure spread points have all unique XKCD names
            xkcd_set = set()
            for lab_color in spread_points:
                rgb_color = nu.lab_to_rgb(LabColor(*lab_color))
                nearest_neighbor = nu.get_nearest_neighbor(rgb_color, nu.xkcd_colors)
                xkcd_set.add(nu.nearest_xkcd_name_dict[nearest_neighbor])
            if len(xkcd_set) == len(spread_points):

                # Measure how far the points are from the fixed point
                spread_value = nu.calculate_spread(spread_points[1:], fixed_point, verbose=False)

                # Store the result as a tuple of (spread_points, spread_value)
                trial_tuple = (spread_points, spread_value)
                trials.append(trial_tuple)  # Add a new trial

                # Update the progress bar
                pbar.update(1)  # Increment the progress bar by 1

        # If an error occurs (e.g., a spread point too close to black or white), skip this trial
        except Exception:
            continue

# Select the trial with points as far away from the fixed point as possible
trial_tuple = max(trials, key=lambda x: x[1])

# Extract the spread points from the best trial
spread_points = [nu.lab_to_rgb(LabColor(*lab_color)) for lab_color in trial_tuple[0]]

# Notice the colors are well-spaced in the pie chart but not in the 3D scatter plot
nu.inspect_spread_points(spread_points, verbose=False)

# Count the occurrences of a sequence of elements (n-grams) in a list
actions = ['jump', 'run', 'jump', 'run', 'jump']
ngrams = ['jump', 'run']
nu.count_ngrams(actions, ngrams)  # 2

# Convert a sequence of strings into a sequence of integers and a mapping dictionary
sequence = ['apple', 'banana', 'apple', 'cherry']
new_sequence, mapping = nu.convert_strings_to_integers(sequence)
print(new_sequence)  # array([0, 1, 0, 2])
print(mapping)  # {'apple': 0, 'banana': 1, 'cherry': 2}

# Take a list of strings with indentation and prefix them based on a
# multi-level list style
level_count = 8
text_list = [
    ' ' * (i*4) + f'This is level {i}'
    for i in range(level_count+1)
]
text_list += [
    ' ' * (i*4) + f'This is level {i} again'
    for i in range(level_count, -1, -1)
]
numbered_list = nu.apply_multilevel_numbering(
    text_list,
    level_map={
        0: "", 4: "I. ", 8: "A. ", 12: "1. ", 16: "a. ",
        20: "I) ", 24: "A) ", 28: "1) ", 32: "a) "
    }, add_indent_back_in=True
)
for line in numbered_list:
    print(line)

# Get the absolute file path where a function object is stored
import os.path as osp
my_function = lambda: None
file_path = nu.get_function_file_path(my_function)
print(osp.abspath(file_path))

# Calculate the Euclidean distance between two colors in RGB space
print(nu.color_distance_from('white', (255, 0, 0)))  # 360.62445840513925
print(nu.color_distance_from(
    '#0000FF', (255, 0, 0)
))  # 360.62445840513925

# Generate a step-by-step description of how to perform a function by hand from its comments
nu.describe_procedure(nu.describe_procedure)

# Check the closest names for typos by comparing items from two different lists
commonly_misspelled_words = ["absence", "consensus", "definitely", "broccoli", "necessary"]
common_misspellings = ["absense", "concensus", "definately", "brocolli", "neccessary"]
typos_df = nu.check_for_typos(
   commonly_misspelled_words,
   common_misspellings,
   rename_dict={'left_item': 'commonly_misspelled', 'right_item': 'common_misspelling'}
).sort_values(
   ['max_similarity', 'commonly_misspelled', 'common_misspelling'],
   ascending=[False, True, True]
)
mask_series = (typos_df.max_similarity < 1.0)
typos_df[mask_series]

# Open a file in Notepad
nu.open_path_in_notepad(r'C:\this_example.txt')

# Retrieve <table>s from a given URL as a list of DataFrames
tables_url = 'https://en.wikipedia.org/wiki/'
tables_url += 'Provinces_of_Afghanistan'
page_tables_list = nu.get_page_tables(tables_url, verbose=True)
page_tables_list[2]

import numpy as np
import pandas as pd

# Create a new column in a DataFrame representing
# the modal value of the D and E columns
df = pd.DataFrame({
'A': [1, 2, 3], 'B': [1.1, 2.2, 3.3], 'C': ['a', 'b', 'c']
})
df['D'] = pd.Series([np.nan, 2, np.nan])
df['E'] = pd.Series([1, np.nan, 3])
df = nu.modalize_columns(df, ['D', 'E'], 'F')
display(df)
assert all(df['A'] == df['F'])

# Identify numeric columns in a DataFrame
import pandas as pd
df = pd.DataFrame({
    'A': [1, 2, 3], 'B': [1.1, 2.2, 3.3], 'C': ['a', 'b', 'c']
})
nu.get_numeric_columns(df)  # ['A', 'B']

# Introspect a Python module to discover available functions and classes programmatically
module_name = 'nu'
import_call = '''
from notebook_utils import NotebookUtilities
nu = NotebookUtilities(
    data_folder_path=osp.abspath(osp.join(os.pardir, 'data')),
    saves_folder_path=osp.abspath(osp.join(os.pardir, 'saves'))
)'''
nu_functions = nu.get_dir_tree(
    module_name, function_calls=[], contains_str='_regex',
    import_call=import_call, recurse_modules=True, level=3,
    verbose=False
)
sorted(nu_functions, key=lambda x: x[::-1])[:6]

# Create a standard sequence plot where each element corresponds to a position on the y-axis
import matplotlib.pyplot as plt
import random

# Define the sequence of user actions
sequence = ["SESSION_START", "LOGIN", "VIEW_PRODUCT"]

# Generate more shopping elements
for _ in range(19):
    if sequence[-1] != 'ADD_TO_CART':
        sequence.append(random.choice(['VIEW_PRODUCT', 'ADD_TO_CART']))
    else:
        sequence.append('VIEW_PRODUCT')

# Finish up the shopping
sequence += ["LOGOUT", "SESSION_END"]

# Define n-grams to highlight
highlighted_ngrams = [["VIEW_PRODUCT", "ADD_TO_CART"]]

# Define a custom color dictionary for the actions
color_dict = {
    "SESSION_START": "green",
    "LOGIN": "blue",
    "VIEW_PRODUCT": "orange",
    "ADD_TO_CART": "purple",
    "LOGOUT": "red",
    "SESSION_END": "black"
}

# Plot the sequence
fig, ax = nu.plot_sequence(
    sequence=sequence,
    highlighted_ngrams=highlighted_ngrams,
    color_dict=color_dict,
    suptitle="User Session Sequence",
    verbose=False
)

# Show the plot
plt.show()

For detailed examples and use cases, refer to the documentation (TODO: write docs).

🧑‍💻 Contributing

We welcome contributions to improve the share library! To contribute:

Fork the repository.
Create a new branch for your feature or bug fix:
```
git checkout -b feature-name
```
Commit your changes:
```
git commit -m "Add feature-name"
```
Push to your branch:
```
git push origin feature-name
```
Open a pull request on GitHub.

🐛 Reporting Issues

If you encounter any bugs or have feature requests, please open an issue in the Issues section of the repository. Be sure to include:

A clear description of the issue.
Steps to reproduce the problem.
Any relevant logs or screenshots.

🛡️ License

This project is licensed under the MIT License. Feel free to use, modify, and distribute it as per the terms of the license.

🌟 Support

If you find this project helpful, please consider giving it a ⭐ on GitHub to show your support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Share: Notebook Utilities for Jupyter Notebooks

📂 Project Structure

Modules

1. Data Preparation

2. Data Validation

3. Data Analysis and Visualization

4. File Operations

5. Sequence Analysis

🚀 Features

📦 Installation

🛠️ Usage

🧑‍💻 Contributing

🐛 Reporting Issues

🛡️ License

🌟 Support

Files

README.md

Latest commit

History

README.md

File metadata and controls

Share: Notebook Utilities for Jupyter Notebooks

📂 Project Structure

Modules

1. Data Preparation

2. Data Validation

3. Data Analysis and Visualization

4. File Operations

5. Sequence Analysis

🚀 Features

📦 Installation

🛠️ Usage

🧑‍💻 Contributing

🐛 Reporting Issues

🛡️ License

🌟 Support