forked from Texera/texera
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge latest master #29
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We are excited to announce texera.io as the official website for the Texera project. - Redundant sections such as “Motivation,” “Education,” and “Videos” have been removed. - The "Goals" and “Publications” (including both computer science and interdisciplinary) section, is retained for some redundancy for now. We will consider to remove them later. - The blog has been relocated to a new dedicated home under texera.io for better accessibility and organization. The old blog site is deprecated now.
This operator's design requires some more discussion. The current issues are: 1. it is accessing the workflow context (i.e., workflowId and executionId), which may not be permitted in the future. 2. It is directly interacting with the local file system. For the short term, we will drop the support for such as downloader. We can discuss how to support it in the future when needed.
This PR removes the obsolete `core/util` package which is used to generate jooq code. The generation logics has been moved to `dao` as class `JooqCodeGenerator`
## Purpose Address #3042 There was an issue where visualization results were very easy to make panels "sticky", meaning the panels would be dragged even though the user is no longer holding down the mouse. This PR aims to fix this issue. ## Changes Add logic to modify the z-index of visualization results directly. If the panel is being dragged, the visualization result's z-index becomes -1, and if the panel is not dragged, the z-index changes back to 0 so that the user can interact with the visualization. ## Demo Before  After  --------- Co-authored-by: Xinyuan Lin <[email protected]>
As titled, this PR removes the redundant jooq codes in package `edu.uci.ics.texera.web.model.jooq.generated` in `core/amber`. All the imports related to them are now pointing to `edu.uci.ics.texera.dao.jooq.generated` in `core/dao`.
This PR modifies the links in README to texera.io. 1. Add link to landing page of texera.io on the texera logo. 2. Change urls to be more meaningful, instead of a numerical number. 3. Add links to publications, video, etc.
This PR removes the cache source descriptor. We explicitly create a physical operator to read cache during scheduling (cut off physical links for materialization).
### Purpose: Since "Hub" has been chosen as the more formal name for the community, the previous name needs to be replaced with "Hub." ### Change: The 'Community' in the left sidebar has been replaced with 'Hub'. ### Demo: Before:  After: 
…bles. (#3168) ### Purpose: Although the column indicating publication already exists in both the dataset and workflow tables, they have different names and types. ### Change: Modified the dataset and workflow tables to unify the format of is_public across different tables. **To complete the database update, please execute `18.sql` located in the update folder.**
This PR unifies the design of the output mode, and make it a property on an output port. Previously we had two modes: 1. SET_SNAPSHOT (return a snapshot of a table) 2. SET_DELTA (return a delta of tuples) And different chart types (e.g., HTML, bar chart, line chart). However, we are only using HTML type after switching to pyplot. Additionally, the output mode and chart type are associated with a logical operator, and passed along to the downstream sink operator, this does not support multiple output ports operators. ### New design 1. Move OutputMode onto an output port's property. 2. Unify to three modes: a. SET_SNAPSHOT (return a snapshot of a table) b. SET_DELTA (return a delta of tuples) c. SINGLE_SNAPSHOT (only used for visualizations to return a html) The SINGLE_SNAPSHOT is needed now as we need a way to differenciate a HTML output vs a normal data table output. This is due to the storage with mongo is limited by 16 mb, and HTMLs are usually larger than 16 mb. After we remove this limitation on the storage, we will remove the SINGLE_SNAPSHOT and fall back to SET_SNAPSHOT.
This PR refactors the handling of sink operators in Texera by removing the sink descriptor and introducing a streamlined approach to creating physical sink operators during compilation and scheduling. Additionally, it shifts the storage assignment logic from the logical layer to the physical layer. 1. **Sink Descriptor Removal:** Removed the sink descriptor, physical sink operators are no longer created through descriptors. In the future, we will remove physical sink operators. 2. **Sink Operator Creation:** - Introduced a temporary factory for creating physical sink operators without relying on a descriptor. - Physical sink operators are now considered part of the sub-plan of their upstream logical operator. For example: If the HashJoin logical operator requires a sink, its physical sub-plan includes the building physicalOp, probing physicalOp, and the sink physicalOp. 3. **Storage Assignment Refactor:** - Merged the storage assignment logic into the physical layer, removing it from the logical layer. - When a physical sink operator is created (either during compilation or scheduling), its associated storage is also created at the same moment.
This PR fixes the issue that, when MongoDB is used as the result storage, the execution keeps failing. The root cause is: when using MongoDB as the result storage, the schema has to be extracted from the physical op the code implementation uses the physical ops in subPlan to extract the schema out, which is incorrect as subPlan's physical ops are not propagated for output schemas How the fix is done is: using physicalPlan.getOperator, instead of subPlan.getOperator. In this way, the operator that is flowing to the downstream is from `physicalPlan`, not `subPlan`
This PR fixes an issue where operators marked for viewing results did not display the expected results due to a mismatch caused by string-based comparison instead of using operator identities. By updating the matching logic to rely on operator identities, this change ensures accurate identification of marked operators and correct display of their results.
Schema propagation is now handled in the physical plan to ensure all ports, both external and internal, have an associated schema. As a result, schema propagation in the logical plan is no longer necessary. In addition, workflow context was used to give user information so that schema propagation can resolve file names. This is also no longer needed as we are resolving files through an explicit call as a step of compiling. WorkflowContext is no longer needed to be set to Logical Operator.
The cache logic will be rewritten at the physical plan layer, using ports as the caching unit. This version is being removed temporarily.
Previously, all results were tied to logical operators. This PR modifies the engine to associate results with output ports, enabling better granularity and support for operators with multiple outputs. ### Key Change: StorageKey with PortIdentity The most significant update in this PR is the adjustment of the storage key format to include both the logical operator ID and the port ID. This ensures that logical operators with multiple output ports (e.g., Split) can have distinct storages created for each output port. For now, the frontend retrieves results from the default output port (port 0). In future updates, the frontend will be enhanced to support retrieving results from additional output ports, providing more flexibility in how results are accessed and displayed.
There are a few operator descriptors written in Java, which makes them difficult to use and maintain. This PR converts all such descriptors to Scala to streamline the migration process to new APIs and facilitate future work, such as operator offloading. Changed Operators: - PythonUDFSourceOpDescV2 - RUDFSourceOpDesc - SentimentAnalysisOpDesc - SpecializedFilterOpDesc - TypeCastingOpDesc
…de-based IDEs (#3180) Add Scala([Metals](https://github.com/scalameta/metals)) generated folders to gitignore to support VSCode and Cursor IDE. Metals is a popular plugin for VSCode-based IDE to develop scala applications.
This PR moves all definitions of protobuf under workflow-core to be within the core package name.
The scalapb proto definition both present in workflow-core and amber. This PR removes the second copy.
The Python protobuf-generated code was outdated. This PR updates the generation script to include all protobuf definitions from the workflow-core and amber sub-projects, ensuring that the latest Python code is generated and aligned with the current protobuf definitions.
Previously, creating a physical operator during compilation also required the creation of its corresponding executor instances. To delay this process so that executor instances were created within workers, we used a lambda function (in `OpExecInitInfo`). However, the lambda approach had a critical limitation: it was not serializable and it is language dependent. This PR addresses this issue by replacing the lambda functions in `OpExecInitInfo` with fully serializable Protobuf entities. The serialized information now ensures compatibility with distributed environments and is language-independent. Two primary types of `OpExecInitInfo` are introduced: 1. **`OpExecWithClassName`**: - **Fields**: `className: String`, `descString: String`. - **Behavior**: The language compiler dynamically loads the class specified by `className` and uses `descString` as its initialization argument. 2. **`OpExecWithCode`**: - **Fields**: `code: String`, `language: String`. - **Behavior**: The language compiler compiles the provided `code` based on the specified `language`. The arguments are already pre-populated into the code string. ### Special Cases The `ProgressiveSink` and `CacheSource` executors are treated as special cases. These executors require additional unique information (e.g., `storageKey`, `workflowIdentity`, `outputMode`) to initialize their executor instances. While this PR preserves the handling of these special cases, these executors will eventually be refactored or removed as part of the plan to move storage management to the port layer.
This PR improves exception handling in AsyncRPCServer to unwrap the actual exception from InvocationTargetException. Old: <img width="889" alt="截屏2024-12-31 上午2 38 34" src="https://github.com/user-attachments/assets/8ec40cce-1b7a-4ecc-8518-a67b7e79888b" /> New: <img width="1055" alt="截屏2024-12-31 上午2 33 18" src="https://github.com/user-attachments/assets/de1c3e42-0dbb-4dbe-8b76-dee486d5bbb9" />
This PR removes all schema propagation functions from the logical plan. Developers are now required to implement `SchemaPropagationFunc` directly within the PhysicalPlan. This ensures that each PhysicalOp has its own distinct schema propagation logic, aligning schema handling more closely with the execution layer. To accommodate the need for schema propagation in the logical plan (primarily for testing purposes), a new method, `getExternalOutputSchemas`, has been introduced. This method facilitates the propagation of schemas across all PhysicalOps within a logical operator, ensuring compatibility with existing testing workflows.
Each port must have exactly one schema. If multiple links are connected to the same port, they are required to share the same schema. This PR introduces a validation step during schema propagation to ensure this constraint is enforced as part of the compilation process. For example, consider a Union operator with a single input port that supports multiple links. If upstream operators produce differing output schemas, the validation will fail with an appropriate error message: 
#### This PR introduces the `CostEstimator` trait which estimates the cost of a region, given some resource units. - The cost estimator is used by `CostBasedScheduleGenerator` to calculate the cost of a schedule during search. - Currently we only consider one type of schedule for each region plan, which is a total order of the regions. The cost of the schedule (and also the cost of the region plan) is thus the summation of the cost of each region. - The resource units are currently passed as placeholders because we assume a region will have all the resources when doing the estimation. The units may be used in the future if we consider different methods of schedule-generation. For example, if we allow two regions to run concurrently, the units will be split in half for each region. #### A `DefaultCostEstimator` implementation is also added, which uses past execution statistics to estimate the wall-clock runtime of a region: - The runtime of each region is represented by the runtime of its longest-running operator. - The runtime of operators are estimated using the statistics from the **latest successful execution** of the workflow. - If such statistics do not exist (e.g., if it is the first execution, or if past executions all failed), we fall back to using number of materialized edges as the cost. - Added test cases using mock mysql data.
To simplify schema creation, this PR removes the Schema.builder() pattern and makes Schema immutable. All modifications now result in the creation of a new Schema instance.
PhysicalOp relies on the input port number to determine if an operator is a source operator. For Python UDF, from the changes in #3183, the input ports are not correctly associated with the PhysicalOp, causing all the Python UDFs to be recognized as source operators. This PR fixes the issue.
The ubuntu-latest image has been updated to 24.04 from 22.04 in recent days. However, the new image is incompatible with libncurses5, requiring an upgrade to libncurses6. Unfortunately, after upgrading, sbt no longer functions as expected, an issue also documented here: [actions/setup-java#712](actions/setup-java#712). It appears that the 24.04 image does not include sbt by default. This PR addresses the issue by pinning the image to ubuntu-22.04. We can revisit and update the version when the 24.04 image becomes more stable and resolves these compatibility problems.
In this PR, we add the user avatar to the execution history panel. <img width="462" alt="Screenshot 2025-01-06 at 3 53 23 PM" src="https://github.com/user-attachments/assets/e4e662af-c1a9-4686-90f4-b6bef155a36b" />
# Implement Apache Iceberg for Result Storage <img width="556" alt="Screenshot 2025-01-06 at 3 18 19 PM" src="https://github.com/user-attachments/assets/4edadb64-ee28-48ee-8d3c-1d1891d69d6a" /> ## How to Enable Iceberg Result Storage 1. Update `storage-config.yaml`: - Set `result-storage-mode` to `iceberg`. ## Major Changes - **Introduced `IcebergDocument`**: A thread-safe `VirtualDocument` implementation for storing and reading results in Iceberg tables. - **Introduced `IcebergTableWriter`**: Append-only writer for Iceberg tables with configurable buffer size. - **Catalog and Data storage for Iceberg**: Uses a local file system (`file:/`) via `HadoopCatalog` and `HadoopFileIO`. This ensures Iceberg operates without relying on external storage services. - `ProgressiveSinkOpExec` with a new parameter `workerId` is added. Each writer of the result storage will take this `workerId` as one new parameter. ## Dependencies - Added Apache Iceberg-related libraries. - Introduced Hadoop-related libraries to support Iceberg's `HadoopCatalog` and `HadoopFileIO`. These libraries are used for placeholder configuration but do not enforce runtime dependency on HDFS. ## Overview of Iceberg Components ### `IcebergDocument` - Manages reading and organizing data in Iceberg tables. - Supports iterator-based incremental reads with thread-safe operations for reading and clearing data. - Initializes or overrides the Iceberg table during construction. ### `IcebergTableWriter` - Writes data as immutable Parquet files in an append-only manner. - Each writer uniquely prefixes its files to avoid conflicts (`workerIndex_fileIndex` format). - Not thread-safe—single-thread access is recommended. ## Data Storage via Iceberg Tables - **Write**: - Tables are created per `storage key`. - Writers append Parquet files to the table, ensuring immutability. - **Read**: - Readers use `IcebergDocument.get` to fetch data via an iterator. - The iterator reads data incrementally while ensuring data order matches the commit sequence of the data files. ## Data Reading Using File Metadata - Data files are read using `getUsingFileSequenceOrder`, which: - Retrieves and sorts metadata files (`FileScanTask`) by sequence numbers. - Reads records sequentially, skipping files or records as needed. - Supports range-based reading (`from`, `until`) and incremental reads. - Sorting ensures data consistency and order preservation. ## Hadoop Usage Without HDFS - The `HadoopCatalog` uses an empty Hadoop configuration, defaulting to the local file system (`file:/`). - This enables efficient management of Iceberg tables in local or network file systems without requiring HDFS infrastructure. --------- Co-authored-by: Shengquan Ni <[email protected]>
This PR removes the `MemoryDocument` class and its usage. Additionally, it updates the fallback mechanism for `MongoDocument`, changing it from `Memory` to `Iceberg`.
…t functioning properly (#3200) ### Purpose: Currently, a 400 error occurs when calling `persistWorkflow`. The reason is that the parameter sent from the frontend to the backend is `isPublished` instead of `isPublic`. ### Change: Change the name of the parameter sent to the backend. ### Demos: Before:  After: 
Remove the redundant Flarum user registration service from the Texera backend, as it is unnecessary. User registration for Flarum can be handled directly by calling the Flarum user registration API from the Texera frontend. This PR does not affect the functionality or lifecycle of either Texera or Flarum.
This PR addresses schema normalization and logic improvements for tracking operator runtime statistics in a workflow execution system. It introduces changes to the database schema, migration scripts, and Scala code responsible for inserting and managing runtime statistics. The goal is to reduce redundancy, improve maintainability, and ensure data consistency between `operator_executions` and `operator_runtime_statistics`. ### Schema Design 1. New Table Design: - `operator_executions`: Tracks execution metadata for each operator in a workflow execution. Each row contains `operator_execution_id`, `workflow_execution_id`, and `operator_id`. This table ensures that operator executions are uniquely identifiable. - `operator_runtime_statistics`: Tracks runtime statistics for each operator execution at specific timestamps. It includes `operator_execution_id` as a foreign key, ensuring a direct reference to `operator_executions`. 2. Normalization Improvements: - Replaced repeated `execution_id` and `operator_id` in `workflow_runtime_statistics` with a single foreign key `operator_execution_id`, pointing to `operator_executions`. - Split the previous large `workflow_runtime_statistics` table into smaller, more manageable tables, eliminating redundancy and improving data integrity. 3. Indexes and Keys: - Added a composite index on `operator_execution_id` and `time` in `operator_runtime_statistics` to speed up joins and queries ordered by time. ### Testing The `core/scripts/sql/update/19.sql` will create the two new tables, `operator_executions` and `operator_runtime_statistics`, and migrate the data from `workflow_runtime_statistics` to those two tables. --------- Co-authored-by: Kunwoo Park <[email protected]> Co-authored-by: Kunwoo Park <[email protected]> Co-authored-by: Kunwoo Park <[email protected]> Co-authored-by: Kunwoo Park <[email protected]> Co-authored-by: Kunwoo Park <[email protected]> Co-authored-by: Kunwoo Park <[email protected]>
### Purpose It looks to restricted/inactive users as they can clone a workflow, when they actually cannot. To reduce the confusion, the clone button is disabled in the GUI. fix #3066 ### Changes Added variables isAdminOrRegularUser in hub-workflow-detail-component.ts. Check the user role. If the user is not an admin or regular user, disable the clone button. https://github.com/user-attachments/assets/c426c50d-fb31-4377-a2b7-3de2a24da512 --------- Co-authored-by: Texera <[email protected]>
Same reason as #3132
This PR refactors the codes style of the `DatasetResource` and `DatasetAccessResource`. Specifically, the refactoring principles are: 1. Move all the function definition in `object`, that are only being used within the `class`, to the `class` 2. Uniform the way of interacting with jooq: use either jooq raw query, or Dao, not both. 3. Flatten single-kv-pair case classes. 4. Flatten the single-line function.
This PR addresses an issue where the system interacts with the MySQL database even when `user-sys.enabled` is set to `false`. In such cases, the system should not perform any interactions with the MySQL database. To resolve this, an if statement has been added to remove the logic, ensuring the system behaves as expected. Co-authored-by: Kunwoo Park <[email protected]>
The `IF` operator evaluates a condition against a specified state variable and routes the input data to either the `True` or `False` branch accordingly. The condition port accepts a state variable, and users can define the name of the state variable to be evaluated by the `IF` operator. Note: The Date to State operator will be introduced in a separate PR. 
…functions (#3211) This PR refactor the storage key by representing it using the storage URI. And this PR also adds the `openDocument` and `createDocument` function using the given URI. ## Major Changes ### 1. VFS URI resource type definition, resolve and decode functions Two types of VFS resources are defined: ``` object VFSResourceType extends Enumeration { val RESULT: Value = Value("result") val MATERIALIZED_RESULT: Value = Value("materializedResult") } ``` Two defs are added to the `FileResolver` - `resolve`: create the URI pointing to the storage resource on the VFS ```java /** * Resolve a VFS resource to its URI. The URI can be used by the DocumentFactory to create resource or open resource * * @param resourceType The type of the VFS resource. * @param workflowId Workflow identifier. * @param executionId Execution identifier. * @param operatorId Operator identifier. * @param portIdentity Optional port identifier. **Required** if `resourceType` is `RESULT` or `MATERIALIZED_RESULT`. * @return A VFS URI * @throws IllegalArgumentException if `resourceType` is `RESULT` but `portIdentity` is missing. */ def resolve( resourceType: VFSResourceType.Value, workflowId: WorkflowIdentity, executionId: ExecutionIdentity, operatorId: OperatorIdentity, portIdentity: Option[PortIdentity] = None ): URI ``` - `decodeVFSUri`: decode a VFS URI to components ```java /** * Parses a VFS URI and extracts its components * * @param uri The VFS URI to parse. * @return A `VFSUriComponents` object with the extracted data. * @throws IllegalArgumentException if the URI is malformed. */ def decodeVFSUri(uri: URI): ( WorkflowIdentity, ExecutionIdentity, OperatorIdentity, Option[PortIdentity], VFSResourceType.Value ) ``` ### 2. `createDocument` and `openDocument` functions to the `DocumentFactory` `createDocument` and `openDocument` defs to create/open a storage resource pointed by the URI - `DocumentFactory.createDocument` ```java /** * Create a document for storage specified by the uri. * This document is suitable for storing structural data, i.e. the schema is required to create such document. * @param uri the location of the document * @param schema the schema of the data stored in the document * @return the created document */ def createDocument(uri: URI, schema: Schema): VirtualDocument[_] ``` - `DocumentFactory.openDocument` ```java /** * Open a document specified by the uri. * The document should be storing the structural data as the document and the schema will be returned * @param uri the uri of the document * @return the VirtualDocument, which is the handler of the data; the Schema, which is the schema of the data stored in the document */ def openDocument(uri: URI): (VirtualDocument[_], Schema) ```
This PR removes the configuration related to the Python language server from application.conf, as it is no longer needed. We have already separated the python language server.
…urceMapping during the garbage collection (#3216) This PR completes the garbage collection logic under the newly-introduced URIs and ExecutionResourceMapping. During the garbage collection, after all the documents associated with the last execution id are released, it then removes the entry from the mapping. This PR also did the scalafixAll to remove unused imports.
This PR fixes the slow workflow execution issue by calling desc.sourceSchema only once in the scan operator executor's constructor, instead of for every output tuple. This PR also includes some reformatting to remove unused imports.
This PR addresses the issue #3016 that execution dashboard takes long time to load. Changed made: - Add loading icon in the frontend while loading the execution info - Move the pagination to backend, including the sorting and filtering function. Loading Icon:  Page Size Change:  --------- Co-authored-by: Chris <[email protected]>
### Purpose: The current Google login button occasionally disappears. The reason is that the Google script is loaded statically, meaning it is automatically loaded via the `<script>` tag when the page loads. This approach has a potential issue: if we want to use the `onGoogleLibraryLoad` callback function to confirm that the script has successfully loaded, we must ensure that the callback is defined before the script finishes loading. Otherwise, when the script completes loading, Google will check whether `onGoogleLibraryLoad` is defined. If it is not defined, the script load notification will be missed, and the Google login initialization logic will not execute. fix #3155 ### Changes: 1. To reduce the maintenance cost caused by frequent updates to the Google official API, we chose to implement Google login using the third-party library angularx-social-login instead of directly relying on the Google official API. 2. Added the dependency @abacritt/angularx-social-login (version 2.1.0). Please use `yarn install` to install it. 3. Additionally, a test file for the Google Login component has been added to verify whether the component initializes correctly after the script is successfully loaded. ### Demos: Before: https://github.com/user-attachments/assets/2717e6cb-d250-49e0-a09d-d3d11ffc7be3 After: https://github.com/user-attachments/assets/81d9e48b-02ee-48ad-8481-d7a50ce43ba0 --------- Co-authored-by: Chris <[email protected]> Co-authored-by: Xinyuan Lin <[email protected]>
Fix the issue #3055 that CSV Scan an empty file would result in error. --------- Co-authored-by: Jiadong Bai <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.