DonelAdam
diff --git a/‎.env.example
+7 b/‎.env.example
+7
diff --git a/‎.gitignore
+1 b/‎.gitignore
+1
diff --git a/‎README.md
+101 b/‎README.md
+101
diff --git a/‎analyzer.py
+142 b/‎analyzer.py
+142
@@ -0,0 +1,7 @@
+AZURE_OPENAI_ENDPOINT="<enter your Azure Open AI enpoint>"
+AZURE_OPENAI_API_KEY="<enter your Azure Open AI Key>"
+AZURE_OPENAI_API_VERSION="2023-03-15-preview"
+SPEECH_KEY="<enter your Azure AI SPEECH_KEY>"
+SPEECH_REGION="<enter your Azure AI Speech region>"
+FORMS_KEY="<enter you Azure AI Document Intelligence Key>"
+FORMS_ENDPOINT="<enter your Azure AI Document Intelligence endpoint>"
@@ -0,0 +1 @@
+.env
@@ -0,0 +1,101 @@
+# OpenAI Document Analyzer
+![Preview Screenshot](preview.png)
+## Introduction
+This demo application was built to show how Azure AI Document Intelligence and Azure Open AI Service can be used to increase the efficiency of document analysis.
+
+You can create a new project and upload your pdf-documents to it. The documents will be analyzed with Azure AI Document Intelligence and the results will be stored in the project folder.
+
+A FAISS Vector Search Index will be created for the documents and you can use similarity search to find relevant content in the documents and create a reduced version of the content to reduce the human effort for reading.
+
+Tables and Key Value pairs that are found in the documents will be extracted and can also be viewed with the click of a button.
+
+The reduced version of the content (a markdown version of all pages where similarity search found relevant content) can be used as context to ask questions about the documents with Azure Open AI ("gpt-35-turbo" or "text-davinci-003")
+
+The application also provides Speech to text and Text to Speech functionality to make it easier to increase its accessibility (currently English and German)
+
+Questions and Queries are stored in the topics subfolder of the project, so that you can easily reuse them i.e. for quantative prompt testing with prompt flow.
+
+For a good demonstration I would suggest you import a set of documents and include information (topics) that you try to find in the documents and the pages that humans would normally look at to gather the information.
+
+These pages can be used to set the ground truth for the topic. If the ground truth is set, the application will notify you if the pages that were found by the similarity search are part of the ground truth or not.
+
+Based on the k value you can increase the likelyhood to find the relevant pages, but you will also increase the number of pages that are not relevant, so that the efficiency gain will be reduced.
+
+However this is a great way to demonstrate the impact of proper prompts and k values on the results
+
+## Prerequisites:
+- Azure Subscription with the following resources:
+  - Azure AI Document Intelligence
+  - Azure Open AI
+  - Azure AI Speech
+
+create a .env file and add the corresponding keys and endpoint information to file (see .env.example for an example)
+
+## Installation:
+I suggest to use conda to creata a virtual python environment 
+
+### Miniconda (example)
+Install Miniconda according to the official documentation. (select add to PATH variable during installation)
+https://conda.io/en/latest/miniconda.html
+
+In your explorer navigate to the analyzer directory and rightclick
+select open with conda prompt and in the prompt enter:
+conda create --name document_analyzer python=3.11
+
+### Install the required packages:
+In your explorer navigate to the analyzer directory and rightclick and open a command prompt terminal
+conda activate document_analyzer
+Install the required packages with the following command:
+pip install -r requirements.txt
+
+## Running the application:
+In your explorer navigate to the analyzer directory and rightclick and open a command prompt terminal
+conda activate document_analyzer
+Run the application with the following command:
+streamlit run document_analyzer.py
+The application will open in a webrowser window at http://localhost:8501/
+
+## Setup:
+### Project
+You start by first creating a new project which will create a new folder in the projects folder
+If you put a logo.png file inside of the project folder it will be used as the logo in the application. otherwise the default logo (img.png) will be used
+
+### Documents
+Now can upload your documents to the project
+After that you can start the analysis with Azure AI Document Intelligence
+Depending on the length of the document, the analysis can take several minutes
+When the Analysis is finished you multiple JSON and md files are created inside the files subfolder of your project:
+- *.json: contains the raw output of the analysis
+- *.md: contains the full text of the document in markdown format
+- *.pagecontent.json: contains the content of the pages in markdown format with the page number as the json key
+- *.tables.md: contains the tables in markdown format
+- *.keyvalues.json: contains the key values in json format with the page number as the json key
+
+#### Chunks
+Based on the anaysis results from Document Intelligence the paragraphs are used to chunk the document into smaller pieces:
+The paragraphes wit sectionHeading role will be used to decide how to split the text. If the sections are larger than the token limit (default 512) the section will be split into smaller chunks.
+
+### Topics
+Now you can create topics which will create a new folder in the topics subfolder of your project
+Inside of the topic folder there are different text files:
+- queries.txt: contains the queries that you want to use for the Vecor Search
+- questions.txt: contains the questions that you want to ask about the topic
+- ground_truth.txt: contains the pages that humans would look at to answer the questions  (for ground truth checking)
+
+## Usage
+### Sidebar
+on the sidebar you can select the project and the document that you want to analyze
+Now you can create or select a topic
+
+### Document Viewer
+On the Document viewer tab you can view the document (Page range or Full), the extracted tables and key values
+
+### Context Query
+On the Context Query tab you can enter a query and the application will search for the most relevant pages and display the content of the pages that are most relevant to the query. This is also the context that is used on the Question Answering tab
+
+### Question Answering
+The Question Answer tab allows you to ask questions about the context using text-davinci-003 or gpt-3-turbo from Azure Open AI
+
+##Support
+The application is provided as is without any support.
+Feel free to use it as a starting point for your own application.
@@ -0,0 +1,142 @@
+# Description: This script analyzes a document with the Form Recognizer Document Analysis API utilizing the General Document Model. The results are written to a json file to files/forms_result.json
+import os
+from azure.ai.formrecognizer import DocumentAnalysisClient
+from azure.core.credentials import AzureKeyCredential
+from azure.core.serialization import AzureJSONEncoder
+from dotenv import load_dotenv, find_dotenv
+import json
+    
+
+# loads the environment variables from the .env file
+load_dotenv(find_dotenv(), override=True)
+
+endpoint = os.environ["FORMS_ENDPOINT"]
+key = os.environ["FORMS_KEY"]
+
+def format_bounding_region(bounding_regions):
+    if not bounding_regions:
+        return "N/A"
+    return ", ".join("Page #{}: {}".format(region.page_number, format_polygon(region.polygon)) for region in bounding_regions)
+
+def format_polygon(polygon):
+    if not polygon:
+        return "N/A"
+    return ", ".join(["[{}, {}]".format(p.x, p.y) for p in polygon])
+
+
+def analyze_general_documents(projectname,documentname):
+    # sample document
+    #docUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"
+
+    # create your `DocumentAnalysisClient` instance and `AzureKeyCredential` variable
+    document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
+
+    #analyze document from url
+    #poller = document_analysis_client.begin_analyze_document_from_url(
+    #        "prebuilt-document", docUrl)
+    #result = poller.result()
+    
+    #analyze document from local file
+    with open("projects/"+projectname+'/files/'+documentname, "rb") as f:
+        print("Analyzing document...")
+        poller = document_analysis_client.begin_analyze_document("prebuilt-document", f)
+        result = poller.result()
+    
+    analyze_result_dict = result.to_dict()
+    #write results to json file
+    jsonfile = "projects/"+projectname+'/files/'+documentname+'.json'
+    with open(jsonfile, 'w', encoding='utf-8') as f:
+        print("Writing results to json file...")
+        json.dump(analyze_result_dict, f, cls=AzureJSONEncoder, ensure_ascii=False, indent=4)
+    return jsonfile   
+    
+"""     for style in result.styles:
+        if style.is_handwritten:
+            print("Document contains handwritten content: ")
+            print(",".join([result.content[span.offset:span.offset + span.length] for span in style.spans]))
+
+    print("----Key-value pairs found in document----")
+    for kv_pair in result.key_value_pairs:
+        if kv_pair.key:
+            print(
+                    "Key '{}' found within '{}' bounding regions".format(
+                        kv_pair.key.content,
+                        format_bounding_region(kv_pair.key.bounding_regions),
+                    )
+                )
+        if kv_pair.value:
+            print(
+                    "Value '{}' found within '{}' bounding regions\n".format(
+                        kv_pair.value.content,
+                        format_bounding_region(kv_pair.value.bounding_regions),
+                    )
+                )
+
+    for page in result.pages:
+        print("----Analyzing document from page #{}----".format(page.page_number))
+        print(
+            "Page has width: {} and height: {}, measured with unit: {}".format(
+                page.width, page.height, page.unit
+            )
+        )
+
+        for line_idx, line in enumerate(page.lines):
+            print(
+                "...Line # {} has text content '{}' within bounding box '{}'".format(
+                    line_idx,
+                    line.content,
+                    format_polygon(line.polygon),
+                )
+            )
+
+        for word in page.words:
+            print(
+                "...Word '{}' has a confidence of {}".format(
+                    word.content, word.confidence
+                )
+            )
+
+        for selection_mark in page.selection_marks:
+            print(
+                "...Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
+                    selection_mark.state,
+                    format_polygon(selection_mark.polygon),
+                    selection_mark.confidence,
+                )
+            )
+
+    for table_idx, table in enumerate(result.tables):
+        print(
+            "Table # {} has {} rows and {} columns".format(
+                table_idx, table.row_count, table.column_count
+            )
+        )
+        for region in table.bounding_regions:
+            print(
+                "Table # {} location on page: {} is {}".format(
+                    table_idx,
+                    region.page_number,
+                    format_polygon(region.polygon),
+                )
+            )
+        for cell in table.cells:
+            print(
+                "...Cell[{}][{}] has content '{}'".format(
+                    cell.row_index,
+                    cell.column_index,
+                    cell.content,
+                )
+            )
+            for region in cell.bounding_regions:
+                print(
+                    "...content on page {} is within bounding box '{}'\n".format(
+                        region.page_number,
+                        format_polygon(region.polygon),
+                    )
+                )
+    print("----------------------------------------") """
+
+
+if __name__ == "__main__":
+    print("Running general document analysis...")
+    analyze_general_documents("files/DWS Annual Report 2022_EN.pdf")