Skip to content

DataHub 업데이트 후 Lang2SQL 벡터화 미반영 #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
ehddnr301 opened this issue Mar 17, 2025 · 2 comments
Open
3 tasks

DataHub 업데이트 후 Lang2SQL 벡터화 미반영 #22

ehddnr301 opened this issue Mar 17, 2025 · 2 comments
Assignees
Labels
bug Something isn't working datahub The integration point that retrieves metadata from DataHub to provide better context for query gener retrieval The mechanism responsible for fetching relevant table and column information to enhance SQL generati

Comments

@ehddnr301
Copy link
Collaborator

Why❓
DataHub 상에서는 테이블 및 컬럼 설명이 업데이트되었지만, lang2sql에서는 이전에 벡터화를 진행한 경우 이를 다시 수행하지 않아 최신 내용이 반영되지 않는 문제가 발생.
사용자가 최신 테이블 설명을 기반으로 SQL을 생성하려 해도, 반영되지 않아 부정확한 쿼리가 생성될 가능성이 있음.

How❓
벡터화 과정에서 DataHub의 최신 정보를 반영하도록 개선 필요.
테이블 및 컬럼 설명이 변경될 경우, lang2sql에서 이를 감지하여 벡터화를 다시 수행하는 방식으로 해결 가능.
벡터화 재실행 조건을 명확하게 정의하여 불필요한 리소스 소모 방지.

What❗️

  • 1안. DataHub에서 테이블 및 컬럼 설명이 변경되었는지 감지하는 로직 추가. (가능하다면)
  • 2안. 벡터화 재실행 여부를 결정하는 기준 정의 및 설정 가능하도록 구성. (1안이 불가능하다면)
  • 변경 사항이 있을 경우 lang2sql에서 벡터화를 다시 수행하는 기능 구현. (필요기능)
@ehddnr301 ehddnr301 added bug Something isn't working datahub The integration point that retrieves metadata from DataHub to provide better context for query gener retrieval The mechanism responsible for fetching relevant table and column information to enhance SQL generati labels Mar 17, 2025
@ehddnr301 ehddnr301 self-assigned this Apr 7, 2025
@ParkGyeongTae
Copy link
Contributor

자세히는 모르겠지만 DataHub의 Description이 업데이트 하더라도 "벡터화" 라는게 필요하군요..!
특정 주기를 통해 벡터화를 하는 것도 방법일 것 같습니다..!
DataHub의 설정이 변경된 것을 감지할 수는 있을 것 같아요! 제 기억이 맞다면 카프카를 데이터허브 내부에서 쓰는걸로 알고있습니당!
특정 토픽을 컨슘하면 되지 않을까 싶습니당

@helloimmelie
Copy link
Collaborator

table_info_db 폴더를 삭제 후, streamlit을 실행하여 최신 버전으로 datahub를 벡터화 시켜도 쿼리와 무관한 테이블의 목록이 '참고한 테이블 목록들' 에 출력됩니다.
문제 해결하실때 참고 부탁드리겠습니다..

작성한 Input: "고객 매출 데이터를 기반으로 유니크한 유저 수를 카운트하는 쿼리"
생성된 쿼리: SELECT "entity_id", COUNT(DISTINCT "activity_ts") AS "unique_payment_count", "feature_json" FROM "client_stream_churned_on_product" WHERE "activity_ts" >= date_trunc('month', today()) - interval '3 months' GROUP BY "entity_id", "feature_json" LIMIT 10

참고한 테이블 목록들:
{"ga_cube_churned_revenue":{"table_description":"Flattened OLAP cube model for ChurnedMRR","metric_date":"The primary key for this table"},"client_stream_responded_to_ces_survey":{"table_description":"Activity data triggered when a client responds to a CES survey","id":"The primary key for this table","entity_id":"The entity id of the client","activity_ts":"The timestamp of the activity","activity":"The name of the activity","revenue_impact":"The revenue impact of the activity","feature_json":"JSON string containing feature data related to the activity, including customer segments such as 'active_users',"},"client_stream_churned_on_product":{"table_description":"Activity data triggered when a client churns on a product","id":"The primary key for this table","entity_id":"The entity id of the client","activity_ts":"The timestamp of the activity","activity":"The name of the activity","revenue_impact":"The revenue impact of the activity","feature_json":"JSON string containing feature data related to the activity, including customer segments such as 'active_users',"},"client_stream_responded_to_csat_survey":{"table_description":"Activity data triggered when a client responds to a CSAT survey","id":"The primary key for this table","entity_id":"The entity id of the client","activity_ts":"The timestamp of the activity","activity":"The name of the activity","revenue_impact":"The revenue impact of the activity","feature_json":"JSON string containing feature data related to the activity, including customer segments such as 'active_users',"}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datahub The integration point that retrieves metadata from DataHub to provide better context for query gener retrieval The mechanism responsible for fetching relevant table and column information to enhance SQL generati
Projects
None yet
Development

No branches or pull requests

3 participants