Share is a Python library designed to streamline workflows for machine learning engineers, data analysts, and data scientists working with Jupyter Notebooks. It provides a comprehensive set of utility functions for data preparation, validation, analysis, visualization, and file operations, all organized into modular classes for better maintainability and usability.
The core functionality is implemented in the NotebookUtilities
class, which acts as a facade to delegate method calls to smaller, focused modules. This ensures backward compatibility while improving internal organization and scalability.
The project is structured into modular components. Each component focuses on a specific aspect of notebook utilities.
To discover a natural structure, I performed a market basket analysis of all the instantiated functions in the notebooks containing them. I then used this to make a a directed graph, the edges representing the maximum confidence in each direction between pairs of functions (so that each node has one "in" edge and one "out" edge). I applied three generations of the Girvan-Newman algorithm to this graph. This analysis helped detect communities, which I used to organize the modules. (You can see some of this work in the Breaking Up notebook_utils notebook.)
Below is an overview of the structure:
notebook_utils.py
: The originalNotebookUtilities
class, now a facade that delegates calls to the smaller, focused classes.base_config.py
: Contains theBaseConfig
class, which provides common attributes and functions inherited by other modules.
- Module Name:
data_preparation.py
- Key Functions:
get_first_year_element
,get_evaluations
,get_library_names
,get_dir_tree
,get_filename_from_url
,get_style_column
,get_td_parent
,download_file
,get_page_soup
,get_page_tables
,get_wiki_tables
.
- Description: Provides the
DataPreparation
class for functions related to data extraction and preparation workflows.
- Module Name:
data_validation.py
- Key Functions:
compute_similarity
,conjunctify_nouns
,check_for_typos
,list_dfs_in_folder
,get_relative_position
,get_random_subdictionary
,plot_inauguration_age
.
- Description: Contains the
DataValidation
class for functions related to validation and utility workflows.
- Module Name:
data_analysis.py
- Key Functions:
open_path_in_notepad
,get_wiki_infobox_data_frame
,get_inf_nan_mask
,get_column_descriptions
,modalize_columns
,get_regexed_columns
,get_regexed_dataframe
,one_hot_encode
,get_flattened_dictionary
,get_numeric_columns
,get_euclidean_distance
,get_nearest_neighbor
,get_minority_combinations
,plot_line_with_error_bars
,plot_histogram
,get_r_squared_value_latex
,get_spearman_rho_value_latex
,first_order_linear_scatterplot
.
- Description: Provides the
DataAnalysis
class for functions related to data analysis and visualization.
- Module Name:
file_operations.py
- Key Functions:
get_utility_file_functions
,get_notebook_functions_dictionary
,get_notebook_functions_set
,show_duplicated_util_fns_search_string
,show_dupl_fn_defs_search_string
,delete_ipynb_checkpoint_folders
,get_random_py_file
,attempt_to_pickle
,csv_exists
,load_csv
,pickle_exists
,load_object
,load_data_frames
,save_data_frames
,store_objects
,get_random_function
.
- Description: Contains the
FileOperations
class for functions related to file cleanup and basic file operations.
- Module Name:
sequence_analysis.py
- Key Functions:
split_list_by_gap
,count_ngrams
,get_sequences_by_count
,split_list_by_exclusion
,convert_strings_to_integers
,get_ndistinct_subsequences
,get_turbulence
,replace_consecutive_elements
,count_swaps_to_perfect_order
,get_color_cycled_list
,plot_sequence
,plot_sequences
.
- Description: Contains the
SequenceAnalysis
class for analyzing, manipulating, and visualizing sequences, including numeric, string, and plotting utilities.
- Backward Compatibility: The
NotebookUtilities
class retains its original interface, ensuring existing notebooks can continue using it without modification. - Modular Design: Functionality is split into smaller, focused classes for better organization and maintainability.
- Comprehensive Utility Functions: Covers a wide range of tasks, including data preparation, validation, analysis, visualization, and file operations.
- Scalable and Extensible: The modular structure makes it easy to add new functionality or extend existing features.
To use the share
library, clone the repository and install the required dependencies:
- Clone the repository and add it as a submodule to your notebooks:
repo="path/to/your/notebooks" cd "$repo" cd ../ git clone https://github.com/dbabbitt/share.git if [ -d "$repo/.git" ]; then cd "$repo" repo_name=$(basename "$repo") # Update the submodule if it already exists if [ -d "share" ]; then echo "Updating share submodule in repository: $repo_name..." git submodule update --remote --merge # Add it if the submodule doesn't exist else echo "Adding share submodule in repository: $repo_name..." git submodule add -f https://github.com/dbabbitt/share.git share git commit -m "Added share submodule" fi fi
- Navigate to the project directory:
cd share
- Install dependencies:
pip install -r requirements.txt
To remove the share
submodule from your repository:
repo="path/to/your/repository"
if [ -d "$repo/.git" ]; then
cd "$repo"
# Check if the submodule exists
if [ -d "share" ]; then
repo_name=$(basename "$repo")
echo "Removing submodule in repository: $repo_name..."
git config -f .gitmodules --remove-section submodule.share
git rm --cached share
# Fully remove the share folder
rm -rf share
git commit -m "Removed submodule share"
fi
fi
Hereβs how you can use the NotebookUtilities
class in your Jupyter notebook:
- Import the
NotebookUtilities
class:import os.path as osp import os import sys # Assuming here you're running this in a Jupyter notebook cell in a subfolder shared_folder = osp.abspath(osp.join(os.pardir, 'share')) assert osp.exists(shared_folder), f"The share submodule is not at {shared_folder}" if shared_folder not in sys.path: sys.path.insert(1, shared_folder) from notebook_utils import NotebookUtilities
- Initialize the class:
nu = NotebookUtilities( # This will create a data folder if there isn't one already data_folder_path=osp.abspath(osp.join(os.pardir, 'data')), # This will create a saves folder if there isn't one already saves_folder_path=osp.abspath(osp.join(os.pardir, 'saves')) )
- Use the utility functions as needed:
import random from tqdm import tqdm from colormath.color_objects import LabColor # Generate a random fixed point in the CIELAB color space ranges = [(0, 100), (-128, 127), (-128, 127)] fixed_point = tuple([random.uniform(a, b) for a, b in ranges]) # Generate a random number of additional points to spread, between 1 and 15 point_count = random.randint(1, 15) # Initialize a list to store trial results trials = [] total_trials = 5 # You can adjust this based on your patience # Perform trials to find the best spread of points with tqdm(total=total_trials) as pbar: while len(trials) < total_trials: # Attempt to spread the points evenly within a unit cube try: spread_points = nu.spread_points_in_cube( point_count, fixed_point, *ranges, verbose=False ) # Ensure spread points have all unique XKCD names xkcd_set = set() for lab_color in spread_points: rgb_color = nu.lab_to_rgb(LabColor(*lab_color)) nearest_neighbor = nu.get_nearest_neighbor(rgb_color, nu.xkcd_colors) xkcd_set.add(nu.nearest_xkcd_name_dict[nearest_neighbor]) if len(xkcd_set) == len(spread_points): # Measure how far the points are from the fixed point spread_value = nu.calculate_spread(spread_points[1:], fixed_point, verbose=False) # Store the result as a tuple of (spread_points, spread_value) trial_tuple = (spread_points, spread_value) trials.append(trial_tuple) # Add a new trial # Update the progress bar pbar.update(1) # Increment the progress bar by 1 # If an error occurs (e.g., a spread point too close to black or white), skip this trial except Exception: continue # Select the trial with points as far away from the fixed point as possible trial_tuple = max(trials, key=lambda x: x[1]) # Extract the spread points from the best trial spread_points = [nu.lab_to_rgb(LabColor(*lab_color)) for lab_color in trial_tuple[0]] # Notice the colors are well-spaced in the pie chart but not in the 3D scatter plot nu.inspect_spread_points(spread_points, verbose=False)
# Count the occurrences of a sequence of elements (n-grams) in a list actions = ['jump', 'run', 'jump', 'run', 'jump'] ngrams = ['jump', 'run'] nu.count_ngrams(actions, ngrams) # 2
# Convert a sequence of strings into a sequence of integers and a mapping dictionary sequence = ['apple', 'banana', 'apple', 'cherry'] new_sequence, mapping = nu.convert_strings_to_integers(sequence) print(new_sequence) # array([0, 1, 0, 2]) print(mapping) # {'apple': 0, 'banana': 1, 'cherry': 2}
# Take a list of strings with indentation and prefix them based on a # multi-level list style level_count = 8 text_list = [ ' ' * (i*4) + f'This is level {i}' for i in range(level_count+1) ] text_list += [ ' ' * (i*4) + f'This is level {i} again' for i in range(level_count, -1, -1) ] numbered_list = nu.apply_multilevel_numbering( text_list, level_map={ 0: "", 4: "I. ", 8: "A. ", 12: "1. ", 16: "a. ", 20: "I) ", 24: "A) ", 28: "1) ", 32: "a) " }, add_indent_back_in=True ) for line in numbered_list: print(line)
# Get the absolute file path where a function object is stored import os.path as osp my_function = lambda: None file_path = nu.get_function_file_path(my_function) print(osp.abspath(file_path))
# Calculate the Euclidean distance between two colors in RGB space print(nu.color_distance_from('white', (255, 0, 0))) # 360.62445840513925 print(nu.color_distance_from( '#0000FF', (255, 0, 0) )) # 360.62445840513925
# Generate a step-by-step description of how to perform a function by hand from its comments nu.describe_procedure(nu.describe_procedure)
# Check the closest names for typos by comparing items from two different lists commonly_misspelled_words = ["absence", "consensus", "definitely", "broccoli", "necessary"] common_misspellings = ["absense", "concensus", "definately", "brocolli", "neccessary"] typos_df = nu.check_for_typos( commonly_misspelled_words, common_misspellings, rename_dict={'left_item': 'commonly_misspelled', 'right_item': 'common_misspelling'} ).sort_values( ['max_similarity', 'commonly_misspelled', 'common_misspelling'], ascending=[False, True, True] ) mask_series = (typos_df.max_similarity < 1.0) typos_df[mask_series]
# Open a file in Notepad nu.open_path_in_notepad(r'C:\this_example.txt')
# Retrieve <table>s from a given URL as a list of DataFrames tables_url = 'https://en.wikipedia.org/wiki/' tables_url += 'Provinces_of_Afghanistan' page_tables_list = nu.get_page_tables(tables_url, verbose=True) page_tables_list[2]
import numpy as np import pandas as pd # Create a new column in a DataFrame representing # the modal value of the D and E columns df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [1.1, 2.2, 3.3], 'C': ['a', 'b', 'c'] }) df['D'] = pd.Series([np.nan, 2, np.nan]) df['E'] = pd.Series([1, np.nan, 3]) df = nu.modalize_columns(df, ['D', 'E'], 'F') display(df) assert all(df['A'] == df['F'])
# Identify numeric columns in a DataFrame import pandas as pd df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [1.1, 2.2, 3.3], 'C': ['a', 'b', 'c'] }) nu.get_numeric_columns(df) # ['A', 'B']
# Introspect a Python module to discover available functions and classes programmatically module_name = 'nu' import_call = ''' from notebook_utils import NotebookUtilities nu = NotebookUtilities( data_folder_path=osp.abspath(osp.join(os.pardir, 'data')), saves_folder_path=osp.abspath(osp.join(os.pardir, 'saves')) )''' nu_functions = nu.get_dir_tree( module_name, function_calls=[], contains_str='_regex', import_call=import_call, recurse_modules=True, level=3, verbose=False ) sorted(nu_functions, key=lambda x: x[::-1])[:6]
# Create a standard sequence plot where each element corresponds to a position on the y-axis import matplotlib.pyplot as plt import random # Define the sequence of user actions sequence = ["SESSION_START", "LOGIN", "VIEW_PRODUCT"] # Generate more shopping elements for _ in range(19): if sequence[-1] != 'ADD_TO_CART': sequence.append(random.choice(['VIEW_PRODUCT', 'ADD_TO_CART'])) else: sequence.append('VIEW_PRODUCT') # Finish up the shopping sequence += ["LOGOUT", "SESSION_END"] # Define n-grams to highlight highlighted_ngrams = [["VIEW_PRODUCT", "ADD_TO_CART"]] # Define a custom color dictionary for the actions color_dict = { "SESSION_START": "green", "LOGIN": "blue", "VIEW_PRODUCT": "orange", "ADD_TO_CART": "purple", "LOGOUT": "red", "SESSION_END": "black" } # Plot the sequence fig, ax = nu.plot_sequence( sequence=sequence, highlighted_ngrams=highlighted_ngrams, color_dict=color_dict, suptitle="User Session Sequence", verbose=False ) # Show the plot plt.show()
For detailed examples and use cases, refer to the documentation (TODO: write docs).
We welcome contributions to improve the share
library! To contribute:
- Fork the repository.
- Create a new branch for your feature or bug fix:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add feature-name"
- Push to your branch:
git push origin feature-name
- Open a pull request on GitHub.
If you encounter any bugs or have feature requests, please open an issue in the Issues section of the repository. Be sure to include:
- A clear description of the issue.
- Steps to reproduce the problem.
- Any relevant logs or screenshots.
This project is licensed under the MIT License. Feel free to use, modify, and distribute it as per the terms of the license.
If you find this project helpful, please consider giving it a β on GitHub to show your support!