Microsoft GraphRAG 분석 및 최적화된 GraphRAG를 활용하기 위한 노하우 및 사용법

Microsoft GraphRAG 아키텍쳐 분석 및 사용법을 분석하며 최적화된 GraphRAG는 무엇인가에 대해 고민한 내용들을 담았습니다.

Nov 30, 2024

Contents

GraphRAG auto-tuning provides rapid adaptation to new domains Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency DRIFT Search: A step-by-step process GraphRAG: Improving global search via dynamic community selection LazyGraphRAG: Setting a new standard for quality and cost GraphRAG Indexing Prompt tuning Query engine Architecture Dataflow CLI로 사용하기 Auto tuning Manual tuning 끝으로, 주목 및 개선해보면 좋을 부분

GraphRAG auto-tuning provides rapid adaptation to new domains Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency DRIFT Search: A step-by-step process GraphRAG: Improving global search via dynamic community selection LazyGraphRAG: Setting a new standard for quality and cost GraphRAG Indexing Prompt tuning Query engine Local Search Global Search Architecture Dataflow CLI로 사용하기 Index Prompt-tune Query Auto tuning Document Selection Methods Manual tuning 끝으로, 주목 및 개선해보면 좋을 부분

GraphRAG auto-tuning provides rapid adaptation to new domains

Published September 9, 2024

Manually creating and tuning a set of domain-specific prompts is time-consuming. We know, as all the prompts used for news articles were generated manually. To streamline this process, we developed an automated tool that generates domain-specific prompts, which are tuned and ready to use. This tool follows a human-like approach; we provided an LLM with a sample of text data (e.g., 1% of 10,000 chemistry papers) and instructed it to produce the prompts it deemed most applicable to the content. Now, with these automatically generated and tuned prompts, we can immediately apply GraphRAG to a new domain of our choosing, confident that we’ll get high-quality results.

During the indexing process, GraphRAG uses a set of prompts to instruct the LLM as it reads through the source content, extracting and organizing relevant information to construct the knowledge graph. Three of GraphRAG’s main indexing prompts include:

Entity and relationship extraction: Identifies all the entities present and establishes relationships among them.

Entity and relationship summarization: Consolidates instances of entities and their relationships into a single, concise description.

Community report generation: Generates a summary report for each community within the constructed knowledge graph.

프롬프트 구성 총 네 가지.

Extraction instructions: Provide the LLM with guidance on how to perform extraction.

Few-shot examples: Supply the LLM real examples of the types of entities and relationships worth extracting.

Real data: Serves as a placeholder that is replaced by chunks of source content.

Gleanings: Encourage the LLM, over multiple turns, to extract additional information.


**Goal**
_Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities._

**Steps**

1. **Identify all entities.** For each identified entity, extract the following information:

- _entity_name_: Name of the entity, capitalized
- _entity_type_: One of the following types: [{entity_types}]
- _entity_description_: Comprehensive description of the entity’s attributes and activities

Format each entity as (“entity“, , , )

12. **From the entities identified in step 1, identify all pairs of (_source_entity_, _target_entity_) that are ***clearly related*** to each other.**

- _source_entity_: name of the source entity, as identified in step 1
- _target_entity_: name of the target entity, as identified in step 1
- _relationship_description_: explanation as to why you think the source entity and the target entity are related to each other
- _relationship_strength_: a numeric score indicating strength of the relationship between the source entity and target entity

Format each relationship as (“relationship“, , , , )

25. **Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.** Use **{record_delimiter}** as the list delimiter.

27. **When finished, output **{completion_delimiter}****

###################### **Examples** ######################

**Example 1**:
_Entity_types_: ORGANIZATION,PERSON
_Text_:
_The Verdantis’s Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%._

**Output**:
(“entity”, **CENTRAL INSTITUTION**, ORGANIZATION, _The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday_)
(“entity”, **MARTIN SMITH**, PERSON, _Martin Smith is the chair of the Central Institution_)
(“entity”, **MARKET STRATEGY COMMITTEE**, ORGANIZATION, _The Central Institution committee makes key decisions about interest rates and the growth of Verdantis’s money supply_)

(“relationship”, **MARTIN SMITH** – **CENTRAL INSTITUTION**, _Martin Smith is the Chair of the Central Institution and will answer questions at a press conference_, 9)

**Example 2**:
_Entity_types_: ORGANIZATION
_Text_:
_TechGlobal’s (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation’s debut on the public markets isn’t indicative of how other newly listed companies may perform._
_TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones._

**Output**:
(“entity”, **TECHGLOBAL**, ORGANIZATION, _TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones_)
(“entity”, **VISION HOLDINGS**, ORGANIZATION, _Vision Holdings is a firm that previously owned TechGlobal_)

(“relationship”, **TECHGLOBAL** – **VISION HOLDINGS**, _Vision Holdings formerly owned TechGlobal from 2014 until present_, 5)

**Example 3**:
_Entity_types_: ORGANIZATION,GEO,PERSON
_Text_:
_Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia._
_The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara._
_The exchange initiated in Firuzabad’s capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara._
_They were welcomed by senior Aurelian officials and are now on their way to Aurelia’s capital, Cashion._
_The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia’s Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality._

**Output**:
(“entity”, **FIRUZABAD**, GEO, _Firuzabad held Aurelians as hostages_)
(“entity”, **AURELIA**, GEO, _Country seeking to release hostages_)
(“entity”, **QUINTARA**, GEO, _Country that negotiated a swap of money in exchange for hostages_)
(“entity”, **TIRUZIA**, GEO, _Capital of Firuzabad where the Aurelians were being held_)
(“entity”, **KROHAARA**, GEO, _Capital city in Quintara_)
(“entity”, **CASHION**, GEO, _Capital city in Aurelia_)
(“entity”, **SAMUEL NAMARA**, PERSON, _Aurelian who spent time in Tiruzia’s Alhamia Prison_)
(“entity”, **ALHAMIA PRISON**, GEO, _Prison in Tiruzia_)
(“entity”, **DURKE BATAGLANI**, PERSON, _Aurelian journalist who was held hostage_)
(“entity”, **MEGGIE TAZBAH**, PERSON, _Bratinas national and environmentalist who was held hostage_)

(“relationship”, **FIRUZABAD** – **AURELIA**, _Firuzabad negotiated a hostage exchange with Aurelia_, 2)
(“relationship”, **QUINTARA** – **AURELIA**, _Quintara brokered the hostage exchange between Firuzabad and Aurelia_, 2)
(“relationship”, **QUINTARA** – **FIRUZABAD**, _Quintara brokered the hostage exchange between Firuzabad and Aurelia_, 2)
(“relationship”, **SAMUEL NAMARA** – **ALHAMIA PRISON**, _Samuel Namara was a prisoner at Alhamia prison_, 8)
(“relationship”, **SAMUEL NAMARA** – **MEGGIE TAZBAH**, _Samuel Namara and Meggie Tazbah were exchanged in the same hostage release_, 2)
(“relationship”, **SAMUEL NAMARA** – **DURKE BATAGLANI**, _Samuel Namara and Durke Bataglani were exchanged in the same hostage release_, 2)
(“relationship”, **MEGGIE TAZBAH** – **DURKE BATAGLANI**, _Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release_, 2)
(“relationship”, **SAMUEL NAMARA** – **FIRUZABAD**, _Samuel Namara was a hostage in Firuzabad_, 2)
(“relationship”, **MEGGIE TAZBAH** – **FIRUZABAD**, _Meggie Tazbah was a hostage in Firuzabad_, 2)
(“relationship”, **DURKE BATAGLANI** – **FIRUZABAD**, _Durke Bataglani was a hostage in Firuzabad_, 2)

######################
**Real Data**
######################

**Entity_types**: {entity_types}
**Text**: {input_text}

**Output**:

Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency

Published October 31, 2024

Local search 의 전역적인 검색 성능을 개선하기 위해 검색 과정에 community 정보를 첨가하는 방식.

to generate detailed responses in a method that balances computational costs with quality outcomes.

DRIFT Search: A step-by-step process

![[Pasted image 20241103185025.png]]

Step1.Primer: When a user submits a query, DRIFT compares it to the top K most semantically relevant community reports. This generates an initial answer along with several follow-up questions, which act as a lighter version of global search. To do this, we expand the query using Hypothetical Document Embeddings (HyDE), to increase sensitivity (recall), embed the query, look up the query against all community reports, select the top K and then use the top K to try to answer the query. The aim is to leverage high-level abstractions to guide further exploration.

Step2. Follow-Up: With the primer in place, DRIFT executes each follow-up using a local search variant. This yields additional intermediate answers and follow-up questions, creating a loop of refinement that continues until the search engine meets its termination criteria, which is currently configured for two iterations (further research will investigate reward functions to guide terminations). This phase represents a globally informed query refinement. Using global data structures, DRIFT navigates toward specific, relevant information within the knowledge graph even when the initial query diverges from the indexing persona. This follow-up process enables DRIFT to adjust its approach based on emerging information.

Step3. Output Hierarchy: The final output is a hierarchy of questions and answers ranked on their relevance to the original query. This hierarchical structure can be customized to fit specific user needs. During benchmark testing, a naive map-reduce approach aggregated all intermediate answers, with each answer weighted equally.

HyDE 활용.

DRIFT code

![[Pasted image 20241103191520.png]]

2.Accerleration

실세계 적용

![[Pasted image 20241103104026.png]]

3.Azure Cloud 에 익숙치 않거나 비용산정이 어렵다 판단되어 MS GraphRAG 부담된다면 Neo4j GraphRAG 따라해보는 것도 추천드려요.

블로그 포스트

코드

parquet 와 neo4j Bolt driver for network speed 10x faster. 를 통해 더욱 효율적인 GraphRAG 버전을 PoC에 적용해보세요!

GraphRAG: Improving global search via dynamic community selection

Published November 15, 2024

LazyGraphRAG: Setting a new standard for quality and cost

Published November 25, 2024

'''

1.Microsoft GraphRAG 오버뷰

Indexing , Prompt Tuning , Query Engine , Architecture 그리고 Dataflow

GraphRAG

Indexing

Data shaper

relational verbs that our DataShaper library provides. These verbs give us the ability to augment text documents with rich, structured data using the power of LLMs such as GPT-4. We utilize these verbs in our standard workflow to extract entities, relationships, claims, community structures, and community reports and summaries. This behavior is customizable and can be extended to support many kinds of AI-based data enrichment and extraction tasks.

쓰는이유 : verbs -> knowledge graph reasoning 을 LLM 에게 잘 이해시키기 위해서.

Data type

1.graphml 2.parquet

Prompt tuning

Entity/Relationship Extraction

Entity/Relationship Description Summarization

Claim Extraction

Community Reports

** https://platform.openai.com/tokenizer 를 활용해 분석했습니다.

Query engine

Local Search

엔티티들을 추출하고 엔티티간 관계를 추출해서 검색에 활용하는 방식.

Global Search

Local Search 의 엔티티들간 관계를 활용해 메타(커뮤니티)를 만들고, 메타마다 계층을 구분지어 도출된 요약 추상화 레벨을 검색에 활용하는 방식.

map-reduce 방식을 활용함.

** map-reduce 이외에도 다양한 방식이 존재합니다. 각 방식마다 장단점이 다르기때문에, 독자분들의 목적에 따라 선정해서 활용하는걸 추천드립니다. 잘 설명되어 있는 문서 하나 첨부해둘게요. 문서 요약 가이드 테디노트님 https://teddylee777.github.io/langchain/summarize-chain/

Architecture

Dataflow

Phase 1 Compose TextUnits Phase 2 Graph Extraction Phase 3 Graph Augmentation Phase 4 Community Summarization Phase 5 Document Processing Phase 6 Network Visualization

CLI로 사용하기

Index

Prompt-tune

Query

Auto tuning

Document Selection Methods

The auto tuning feature ingests the input data and then divides it into text units the size of the chunk size parameter. After that, it uses one of the following selection methods to pick a sample to work with for prompt generation:

random: Select text units randomly. This is the default and recommended option.

top: Select the head n text units.

all: Use all text units for the generation. Use only with small datasets; this option is not usually recommended.

auto: Embed text units in a lower-dimensional space and select the k nearest neighbors to the centroid. This is useful when you have a large dataset and want to select a representative sample.

Manual tuning

위에 언급한 내용들 {} token 들을 바꾸면서 직접 튜닝할 수 있음 .init directory 에 있는 프롬프트 .txt 들을 본인의 입맛에 맞게 프롬프트 튜닝하면 된다.

유의할 점은 auto 가 아니다보니 auto 에선 모든게 자동으로 되었던 Entity , description 을 직접 입력해주어야함. Domain-specific 을 위한 도메인 , 메타 사전이 있다면 활용하는 것을 추천하지만, 그렇지 않다면 추천하지않습니다. auto 를 활용하며 불만족한 결과들을 현업 담당자분들과 논의하며 점진적으로 개선하는 방식을 활용하는게 좋아보여요.

끝으로, 주목 및 개선해보면 좋을 부분

1.entity 관련 연산을 비교해보기

2.table <-> graph ,cosmos DB + k8s ... <-> neo4j

주목할 점에서 개선할 부분이 떠오르신다면 Opensource 기여해보는것도 좋아요. 예를 들면, community detection 개선 prompt engineer further direction 들의 내역들을 살펴보면서 본인의 기여점을 찾아보기

아래에는 각각의 포스팅마다 작성되어 있는 Further direction들을 모아놓았는데요. 한 번 살펴보셔도 좋을거 같네요.

further direciton list

1.Our goal is to ensure that, whatever the constraints of the deployment context, there is a GraphRAG configuration that can accommodate these constraints while still delivering exceptional response quality. https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/

2.we’re exploring other methods to build on this auto-tuning work. We’re excited to investigate new approaches for creating the core GraphRAG knowledge graph and are also studying ways to measure and evaluate the quality of these graph structures. Additionally, we’re researching methods to better assess performance so that we can identify the types of queries where GraphRAG provides unique value. This includes evaluating human-generated versus auto-tuned prompts, as well as exploring potential improvements to the auto-tuner. https://www.microsoft.com/en-us/research/blog/graphrag-auto-tuning-provides-rapid-adaptation-to-new-domains/

3.A future version of DRIFT will incorporate an improved version of Global Search that will allow it to more directly address questions currently serviced best by global search. The hope is to then move towards a single query interface that can service questions of both local and global varieties. This work will further evolve DRIFT’s termination logic, potentially through a reward model that balances novel information with redundancy. Additionally, executing follow-up queries using either global or local search modes could improve efficiency. Some queries require broader data access, which can be achieved by leveraging a query router and a lite-global search variant that uses fewer community reports, tokens, and overall resources. https://www.microsoft.com/en-us/research/blog/introducing-drift-search-combining-global-and-local-search-methods-to-improve-quality-and-efficiency/

4.File format

table -> Iceberg ... table designed graph -> ...