Skip to content

Commit 0c10791

Browse files
committed
Initial Version
1 parent a3e2ca7 commit 0c10791

29 files changed

+1829816
-0
lines changed

.env.example

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
AZURE_OPENAI_ENDPOINT="<enter your Azure Open AI enpoint>"
2+
AZURE_OPENAI_API_KEY="<enter your Azure Open AI Key>"
3+
AZURE_OPENAI_API_VERSION="2023-03-15-preview"
4+
SPEECH_KEY="<enter your Azure AI SPEECH_KEY>"
5+
SPEECH_REGION="<enter your Azure AI Speech region>"
6+
FORMS_KEY="<enter you Azure AI Document Intelligence Key>"
7+
FORMS_ENDPOINT="<enter your Azure AI Document Intelligence endpoint>"

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.env

README.md

+101
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# OpenAI Document Analyzer
2+
![Preview Screenshot](preview.png)
3+
## Introduction
4+
This demo application was built to show how Azure AI Document Intelligence and Azure Open AI Service can be used to increase the efficiency of document analysis.
5+
6+
You can create a new project and upload your pdf-documents to it. The documents will be analyzed with Azure AI Document Intelligence and the results will be stored in the project folder.
7+
8+
A FAISS Vector Search Index will be created for the documents and you can use similarity search to find relevant content in the documents and create a reduced version of the content to reduce the human effort for reading.
9+
10+
Tables and Key Value pairs that are found in the documents will be extracted and can also be viewed with the click of a button.
11+
12+
The reduced version of the content (a markdown version of all pages where similarity search found relevant content) can be used as context to ask questions about the documents with Azure Open AI ("gpt-35-turbo" or "text-davinci-003")
13+
14+
The application also provides Speech to text and Text to Speech functionality to make it easier to increase its accessibility (currently English and German)
15+
16+
Questions and Queries are stored in the topics subfolder of the project, so that you can easily reuse them i.e. for quantative prompt testing with prompt flow.
17+
18+
For a good demonstration I would suggest you import a set of documents and include information (topics) that you try to find in the documents and the pages that humans would normally look at to gather the information.
19+
20+
These pages can be used to set the ground truth for the topic. If the ground truth is set, the application will notify you if the pages that were found by the similarity search are part of the ground truth or not.
21+
22+
Based on the k value you can increase the likelyhood to find the relevant pages, but you will also increase the number of pages that are not relevant, so that the efficiency gain will be reduced.
23+
24+
However this is a great way to demonstrate the impact of proper prompts and k values on the results
25+
26+
## Prerequisites:
27+
- Azure Subscription with the following resources:
28+
- Azure AI Document Intelligence
29+
- Azure Open AI
30+
- Azure AI Speech
31+
32+
create a .env file and add the corresponding keys and endpoint information to file (see .env.example for an example)
33+
34+
## Installation:
35+
I suggest to use conda to creata a virtual python environment
36+
37+
### Miniconda (example)
38+
Install Miniconda according to the official documentation. (select add to PATH variable during installation)
39+
https://conda.io/en/latest/miniconda.html
40+
41+
In your explorer navigate to the analyzer directory and rightclick
42+
select open with conda prompt and in the prompt enter:
43+
conda create --name document_analyzer python=3.11
44+
45+
### Install the required packages:
46+
In your explorer navigate to the analyzer directory and rightclick and open a command prompt terminal
47+
conda activate document_analyzer
48+
Install the required packages with the following command:
49+
pip install -r requirements.txt
50+
51+
## Running the application:
52+
In your explorer navigate to the analyzer directory and rightclick and open a command prompt terminal
53+
conda activate document_analyzer
54+
Run the application with the following command:
55+
streamlit run document_analyzer.py
56+
The application will open in a webrowser window at http://localhost:8501/
57+
58+
## Setup:
59+
### Project
60+
You start by first creating a new project which will create a new folder in the projects folder
61+
If you put a logo.png file inside of the project folder it will be used as the logo in the application. otherwise the default logo (img.png) will be used
62+
63+
### Documents
64+
Now can upload your documents to the project
65+
After that you can start the analysis with Azure AI Document Intelligence
66+
Depending on the length of the document, the analysis can take several minutes
67+
When the Analysis is finished you multiple JSON and md files are created inside the files subfolder of your project:
68+
- *.json: contains the raw output of the analysis
69+
- *.md: contains the full text of the document in markdown format
70+
- *.pagecontent.json: contains the content of the pages in markdown format with the page number as the json key
71+
- *.tables.md: contains the tables in markdown format
72+
- *.keyvalues.json: contains the key values in json format with the page number as the json key
73+
74+
#### Chunks
75+
Based on the anaysis results from Document Intelligence the paragraphs are used to chunk the document into smaller pieces:
76+
The paragraphes wit sectionHeading role will be used to decide how to split the text. If the sections are larger than the token limit (default 512) the section will be split into smaller chunks.
77+
78+
### Topics
79+
Now you can create topics which will create a new folder in the topics subfolder of your project
80+
Inside of the topic folder there are different text files:
81+
- queries.txt: contains the queries that you want to use for the Vecor Search
82+
- questions.txt: contains the questions that you want to ask about the topic
83+
- ground_truth.txt: contains the pages that humans would look at to answer the questions (for ground truth checking)
84+
85+
## Usage
86+
### Sidebar
87+
on the sidebar you can select the project and the document that you want to analyze
88+
Now you can create or select a topic
89+
90+
### Document Viewer
91+
On the Document viewer tab you can view the document (Page range or Full), the extracted tables and key values
92+
93+
### Context Query
94+
On the Context Query tab you can enter a query and the application will search for the most relevant pages and display the content of the pages that are most relevant to the query. This is also the context that is used on the Question Answering tab
95+
96+
### Question Answering
97+
The Question Answer tab allows you to ask questions about the context using text-davinci-003 or gpt-3-turbo from Azure Open AI
98+
99+
##Support
100+
The application is provided as is without any support.
101+
Feel free to use it as a starting point for your own application.

analyzer.py

+142
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Description: This script analyzes a document with the Form Recognizer Document Analysis API utilizing the General Document Model. The results are written to a json file to files/forms_result.json
2+
import os
3+
from azure.ai.formrecognizer import DocumentAnalysisClient
4+
from azure.core.credentials import AzureKeyCredential
5+
from azure.core.serialization import AzureJSONEncoder
6+
from dotenv import load_dotenv, find_dotenv
7+
import json
8+
9+
10+
# loads the environment variables from the .env file
11+
load_dotenv(find_dotenv(), override=True)
12+
13+
endpoint = os.environ["FORMS_ENDPOINT"]
14+
key = os.environ["FORMS_KEY"]
15+
16+
def format_bounding_region(bounding_regions):
17+
if not bounding_regions:
18+
return "N/A"
19+
return ", ".join("Page #{}: {}".format(region.page_number, format_polygon(region.polygon)) for region in bounding_regions)
20+
21+
def format_polygon(polygon):
22+
if not polygon:
23+
return "N/A"
24+
return ", ".join(["[{}, {}]".format(p.x, p.y) for p in polygon])
25+
26+
27+
def analyze_general_documents(projectname,documentname):
28+
# sample document
29+
#docUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"
30+
31+
# create your `DocumentAnalysisClient` instance and `AzureKeyCredential` variable
32+
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
33+
34+
#analyze document from url
35+
#poller = document_analysis_client.begin_analyze_document_from_url(
36+
# "prebuilt-document", docUrl)
37+
#result = poller.result()
38+
39+
#analyze document from local file
40+
with open("projects/"+projectname+'/files/'+documentname, "rb") as f:
41+
print("Analyzing document...")
42+
poller = document_analysis_client.begin_analyze_document("prebuilt-document", f)
43+
result = poller.result()
44+
45+
analyze_result_dict = result.to_dict()
46+
#write results to json file
47+
jsonfile = "projects/"+projectname+'/files/'+documentname+'.json'
48+
with open(jsonfile, 'w', encoding='utf-8') as f:
49+
print("Writing results to json file...")
50+
json.dump(analyze_result_dict, f, cls=AzureJSONEncoder, ensure_ascii=False, indent=4)
51+
return jsonfile
52+
53+
""" for style in result.styles:
54+
if style.is_handwritten:
55+
print("Document contains handwritten content: ")
56+
print(",".join([result.content[span.offset:span.offset + span.length] for span in style.spans]))
57+
58+
print("----Key-value pairs found in document----")
59+
for kv_pair in result.key_value_pairs:
60+
if kv_pair.key:
61+
print(
62+
"Key '{}' found within '{}' bounding regions".format(
63+
kv_pair.key.content,
64+
format_bounding_region(kv_pair.key.bounding_regions),
65+
)
66+
)
67+
if kv_pair.value:
68+
print(
69+
"Value '{}' found within '{}' bounding regions\n".format(
70+
kv_pair.value.content,
71+
format_bounding_region(kv_pair.value.bounding_regions),
72+
)
73+
)
74+
75+
for page in result.pages:
76+
print("----Analyzing document from page #{}----".format(page.page_number))
77+
print(
78+
"Page has width: {} and height: {}, measured with unit: {}".format(
79+
page.width, page.height, page.unit
80+
)
81+
)
82+
83+
for line_idx, line in enumerate(page.lines):
84+
print(
85+
"...Line # {} has text content '{}' within bounding box '{}'".format(
86+
line_idx,
87+
line.content,
88+
format_polygon(line.polygon),
89+
)
90+
)
91+
92+
for word in page.words:
93+
print(
94+
"...Word '{}' has a confidence of {}".format(
95+
word.content, word.confidence
96+
)
97+
)
98+
99+
for selection_mark in page.selection_marks:
100+
print(
101+
"...Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
102+
selection_mark.state,
103+
format_polygon(selection_mark.polygon),
104+
selection_mark.confidence,
105+
)
106+
)
107+
108+
for table_idx, table in enumerate(result.tables):
109+
print(
110+
"Table # {} has {} rows and {} columns".format(
111+
table_idx, table.row_count, table.column_count
112+
)
113+
)
114+
for region in table.bounding_regions:
115+
print(
116+
"Table # {} location on page: {} is {}".format(
117+
table_idx,
118+
region.page_number,
119+
format_polygon(region.polygon),
120+
)
121+
)
122+
for cell in table.cells:
123+
print(
124+
"...Cell[{}][{}] has content '{}'".format(
125+
cell.row_index,
126+
cell.column_index,
127+
cell.content,
128+
)
129+
)
130+
for region in cell.bounding_regions:
131+
print(
132+
"...content on page {} is within bounding box '{}'\n".format(
133+
region.page_number,
134+
format_polygon(region.polygon),
135+
)
136+
)
137+
print("----------------------------------------") """
138+
139+
140+
if __name__ == "__main__":
141+
print("Running general document analysis...")
142+
analyze_general_documents("files/DWS Annual Report 2022_EN.pdf")

0 commit comments

Comments
 (0)