Software 2.0 시대 - AI 도입을 위해선 Evaluation이 가장 큰 숙제

Feb 11, 2024

Contents

Software 2.0 TL;DR 영문 발췌 국문 번역 느낀 점 Hurdles to AI Adoption TL;DR (주요 Excerpt)Conclusion

최근 Andrej Karpathy의 Software 2.0 블로그를 다시 읽어봤습니다. AI 시장에 대해 이해도를 쪼~~끔 높이고 다시 읽어보니 감회가 새로웠습니다.

Software 2.0 TL;DR

영문 발췌

국문 번역

Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we develop software. They are Software 2.0.

신경망은 단지 또 다른 분류기(classifier)가 아니며, 우리가 소프트웨어를 개발하는 방식에 근본적인 변화의 시작을 나타냅니다. 이들은 소프트웨어 2.0입니다.

The “classical stack” of Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer.

소프트웨어 1.0의 "전통적인 스택"은 우리 모두가 익숙한 것입니다 — 이것은 파이썬, C++ 등의 언어로 작성됩니다. 이것은 프로그래머에 의해 컴퓨터에 작성된 명시적인 지시사항으로 구성됩니다.

In contrast, Software 2.0 is written in much more abstract, human unfriendly language, such as the weights of a neural network. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard.

반대로, 소프트웨어 2.0은 신경망의 가중치(weights)와 같은 훨씬 더 추상적이고, 사람에게 친숙하지 않은 언어로 작성됩니다. 가중치가 많기 때문에(일반적인 네트워크는 수백만 개를 가질 수 있음), 이 코드를 작성하는 데 인간이 관여하지 않습니다. 가중치로 직접 코딩하는 것은 어렵습니다.

Instead, our approach is to specify some goal on the behavior of a desirable program (e.g., “satisfy a dataset of input output pairs of examples”, or “win a game of Go”), write a rough skeleton of the code (i.e. a neural net architecture) that identifies a subset of program space to search, and use the computational resources at our disposal to search this space for a program that works.

대신, 우리의 접근 방식은 원하는 프로그램의 행동에 대한 목표를 명시하는 것입니다(예를 들어, "입출력 쌍의 예시 데이터셋을 만족시키거나", "바둑 게임에서 이기는 것"), 코드의 대략적인 구조를 작성합니다(즉, 신경망 아키텍처) 그것이 검색할 프로그램 공간의 하위 집합을 식별하고, 우리가 가지고 있는 컴퓨팅 자원을 사용하여 이 공간을 검색하여 작동하는 프로그램을 찾습니다.

It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data (or more generally, identify a desirable behavior) than to explicitly write the program. Because of this and many other benefits of Software 2.0 programs that I will go into below, we are witnessing a massive transition across the industry where of a lot of 1.0 code is being ported into 2.0 code. Software (1.0) is eating the world, and now AI (Software 2.0) is eating software.

실제로 실세계 문제의 큰 부분은 데이터를 수집하는 것이 (일반적으로, 바람직한 행동을 식별하는 것) 프로그램을 명시적으로 작성하는 것보다 훨씬 쉽다는 특성을 가지고 있습니다. 이런 이유와 소프트웨어 2.0 프로그램의 많은 다른 이점들 때문에, 우리는 많은 1.0 코드가 2.0 코드로 포팅되는 산업 전반에 걸친 대규모 전환을 목격하고 있습니다. 소프트웨어(1.0)가 세계를 먹어치우고 있고, 이제 AI(소프트웨어 2.0)가 소프트웨어를 먹어치우고 있습니다.

느낀 점

이 블로그는 2017년 11월에 publish 되었습니다. 2020년 GPT-3가 출시되기 훨씬 전이며 2017년 “Attention is All You Need” 페이퍼 출시 직후입니다.

분명히 1년 전에만 읽었을 때엔 이해가 되지 않았는데 이젠 Andrej의 소프트웨어 2.0 비유가 매우 크게 와닿고 AI 혁신을 이해하고 바라보는데 훌륭한 Framework을 제공합니다.

소프트웨어 2.0은 인간의 수동적 코딩으로 만들어지는 것이 아니라 데이터와 우리가 세팅한 목표로 소프트웨어(neural network)이 만들어 질 것

블로그에선 “소프트웨어 2.0”의 강점과 제약도 나열합니다. 예를 들어, 컴퓨팅이 단순하기 때문에 연산이 가능한 GPU, 또는 Custom Chip이 사용 가능하지만 “블랙박스” 현상이 존재한다던지 등 지금까지도 산업에서 논의하는 주제를 얘기합니다.

Hindsight is 20:20 - 제가 제일 아쉬운 것은 세상이 2022년 11월 ChatGPT 모먼트를 갖기 5년 전부터 알 수 있었던 Gen AI라는 gold mine을 놓친 것입니다. 다신 놓치고 싶지 않습니다.

이 사실을 그 당시에 Andrej만 알고 있었을까요? 아닙니다. 대중들이 알기 전 산업의 흐름을 일찍 파악하시고 확신을 갖고 과감한 리스크 테이킹을 하신 분들이 새삼 다시 한번 존경스럽게 느껴집니다.

Hurdles to AI Adoption

AI는 다른 hype cycle과 달리 어마어마한 economic value creation 기회라는 것을 믿어 의심치 않습니다. 다만, 기대와 달리 Big Tech 기업들 외 아직 현장에선 AI Adoption이 예상보다 더딘 것을 느낄 수 있습니다.

Jooho Yeo는 “What the History of Software Development Tells Us about the Hurdles to Enterprise Adoption of LLMs” 블로그에서 LLM 어플리케이션의 적절한 Testing과 Evaluation 툴의 부재 때문이라고 지적합니다.

💡

블로그 글 전문은 훌륭한 분석이 있으니 꼭 본문을 읽어보시는 것을 추천드리며 AI에 관심 있으신 분들이라면 LinkedIn에서도 자주 인사이트를 올리시니 Follow 권유 드립니다!

TL;DR (주요 Excerpt)

Role of DevOps in Software 1.0

The rise of DevOps tools contributed to software eating the world (as Marc Andreesen phrased it) in the 2010s. By the end of the 2010s, “every enterprise became a software company.” DevOps tools fueled the developer productivity needed to support the cloud and mobile platform shifts. With client expectations for reliability rising, tools that addressed testing and monitoring needs were especially critical for growing enterprise adoption.

DevOps 도구의 등장은 2010년대 “Software is Eating the World”를 가능케 했습니다. 2010년대가 끝날 무렵, 모든 기업이 소프트웨어 회사가 되었습니다. DevOps 도구는 클라우드와 모바일 플랫폼 변화를 지원하는 개발자 생산성을 촉진시켰습니다. 신뢰성에 대한 고객의 기대가 높아지면서, 테스트와 모니터링 요구를 해결하는 도구는 특히 기업 채택을 확대하는 데 있어 중요했습니다.

As the software development life cycle matured, each step gave way to successful startups. Here are just a few:

While all steps of the software development process are important, ultimately the objective of developing software is to deliver value to users reliably. As a reflection of its significance, testing has been a large portion (20–30%) of enterprise software development budgets.

소프트웨어 개발 과정의 모든 단계가 중요하지만, 궁극적으로 소프트웨어 개발의 목적은 사용자에게 신뢰할 수 있게 안정적인 가치를 제공하는 것입니다. 그 중요성을 반영하여, 테스트는 기업 소프트웨어 개발 예산의 큰 부분(20-30%)을 차지하고 있습니다.

Beyond testing during the software development process, it is essential for businesses to monitor their software post-deployment to make sure it is operating as expected. According to Honeycomb, companies spend up to 30% of their infrastructure costs on observability.

소프트웨어 개발 과정 중 테스트뿐만 아니라, 배포 후 소프트웨어를 모니터링하여 예상대로 작동하는지 확인하는 것이 기업에게 필수적입니다. Honeycomb에 따르면, 회사들은 관찰 기능에 인프라 비용의 최대 30%를 지출합니다.

Role of MLOps in AI Apps

As Machine Learning came into mainstream software development, it had different requirements from traditional software development, leading to rise of “MLOps”

머신 러닝이 주류 소프트웨어 개발에 도입되면서, 전통적인 소프트웨어 개발 DevOps와 다른 요구사항을 가지게 되어 "MLOps"의 등장을 이끌었습니다.

Akin to Machine Learning, LLMs have changed how applications are developed. The pre-trained LLMs are sufficient for demos and even some MVPs. However, data is still necessary for evaluating not only the outputs of the customized models, but also those of the overall LLM application.

머신 러닝과 마찬가지로, LLM은 어플리케이션 개발 방식을 변화시켰습니다. 사전 훈련된 LLM은 데모와 심지어 일부 MVP에 충분합니다. 그러나, 맞춤형 모델 뿐만 아니라 전체 LLM 어플리케이션의 아웃풋을 평가하기 위해 데이터는 여전히 필요합니다.

Text generation tasks are also much harder to evaluate than classification tasks. Classification tasks are either correct or incorrect, whereas generated text has multiple dimensions to be evaluated on like factual consistency, relevance, and coherence. It is a difficult task to not only comprehensively assess output quality, but also weigh them systematically to be helpful in the comparison.

텍스트 생성 작업은 분류 작업보다 평가하기가 훨씬 어렵습니다. 분류 작업은 맞거나 틀린 것으로 나뉘지만, 생성된 텍스트는 사실 팩트, 관련성, 그리고 일관성 같은 여러 차원에서 평가되어야 합니다. 아웃풋 품질을 종합적으로 평가하는 것뿐만 아니라 체계적으로 가중치를 두어 비교에 도움이 되는 것은 어려운 작업입니다.

Problems with Evaluation of LLMs Today

LLM evaluation is currently a time-consuming process. When asked what percentage of their time they spend on testing and evaluation, respondents said that they spend 35%. In addition, the top issues with their current LLM evaluation process were that it takes “too much engineering resources” (76%) and “too much time” (68%).

LLM 평가는 현재 시간이 매우 많이 소요되는 과정입니다. 테스트 및 평가에 시간을 얼마나 소비하는지 물었을 때, 응답자들은 35%의 시간을 소비한다고 답했습니다. 또한, 현재 LLM 평가 과정에서 가장 큰 문제점은 "너무 많은 엔지니어링 자원을 소모한다"(76%)와 "너무 많은 시간이 걸린다"(68%)였습니다.

The evaluation process is manual. At 64%, offline manual evaluation (done by ML engineers or team members) was by far the most common form of evaluating LLMs noted by practitioners.

평가 과정은 매우 수동적입니다. Survey의 64%가 오프라인 수동 평가(ML 엔지니어나 팀 멤버가 수행)가 가장 일반적인 Evaluation 형태라고 응답했습니다.

Stitching together tools ad-hoc is time consuming and unscalable. 40% of respondents said their organization uses an internally developed solution for testing and evaluation, while only 4% said they are evaluating external solutions. The lack of available external options have driven companies to build their own solutions.

도구들을 아드혹으로 결합하는 것은 시간이 많이 걸리고 확장성이 없습니다. 응답자의 40%는 테스트 및 평가를 위해 내부에서 개발된 솔루션을 사용한다고 했으며, 단 4%만이 외부 솔루션을 평가하고 있다고 답했습니다. 사용 가능한 외부 옵션의 부재는 회사들이 자체 솔루션을 구축하도록 이끌었습니다.

From selecting LLMs to monitoring the application in production, evaluation affects each stage of the LLM App Development Life Cycle. In LLM app development, evaluation and testing is no longer a single step, but continuously executed.

LLM 선택부터 생산 중 애플리케이션을 모니터링하는 것까지, 평가는 LLM 앱 개발 생명 주기의 각 단계에 영향을 미칩니다. LLM 앱 개발에서 평가와 테스트는 더 이상 단일 단계가 아니라 지속적으로 실행됩니다.

What we heard consistently was that engineers are searching for an all-in-one solution that covers everything from testing in development to monitoring in production.

Survey에서 일관되게 들었던 것은 엔지니어들이 개발 중 테스트부터 생산 중 모니터링에 이르기까지 모든 것을 커버하는 원스톱 솔루션을 찾고 있다는 것이었습니다.

The second highest desired quality was customizability (72%). While default test sets and metrics are helpful, practitioners are looking for solutions where they can also add testing and evaluation methods specific to their business needs.

두 번째로 높게 원하는 특성은 맞춤화(Customizability) (72%)이었습니다. 기본 테스트 세트와 메트릭이 유용하긴 하지만, 실무자들은 자신들의 비즈니스 요구에 특화된 테스트 및 평가 방법을 추가할 수 있는 솔루션을 찾고 있습니다.

Rise of Evaluation-Focused Startups

Evaluation is the bottleneck throughout the development life cycle.

개발 생명 주기 전반에 걸쳐 Evaluation은 주요 바틀넥입니다.

Generic LLM benchmarks are insufficient to evaluate for specific business use cases. Even disregarding customizations, a LLM may perform very differently across general tasks vs specific business use cases. To borrow David Hershey’s analogy, benchmarks are like SATs: they are helpful in assessing one’s skills in a range of subjects. But what matters to enterprises are the job interviews: how the candidate performs on tasks pertinent to the specific business use case in that environment.

일반적인 LLM 벤치마크는 특정 비즈니스 사례를 평가하기에 부족합니다. Customization을 무시하더라도, LLM은 General Use Case 작업 대비 특정 비즈니스 사례에서 매우 다르게 작동 될 수 있습니다. David Hershey의 비유를 빌리자면, 벤치마크는 SAT(수능)와 같습니다: 다양한 주제에서 개인의 기술을 평가하는 데 도움이 됩니다. 그러나 기업에 중요한 것은 면접입니다: 해당 환경에서 특정 비즈니스 사례에 관련된 작업에서 면접자가 어떻게 수행하는지입니다.

Thought leaders are predicting that 2024 will be the year when we will see companies move from prototype to production for genAI adoption. Evaluation will be the first step required to assess ROI for past projects and forecast investments needed for future projects.

Thought Leader들은 2024년이 genAI 채택을 위해 기업들이 프로토타입에서 생산으로 넘어가는 해가 될 것이라고 예측하고 있습니다. Evaluation 과거 프로젝트의 ROI를 평가하고 미래 프로젝트를 위한 투자 계획에 필요한 첫 단계가 될 것입니다.

Conclusion

Wow. JooHo는 역사적으로 소프트웨어 개발흐름 관점, 실제 현장 데이터와 사례를 기반으로 현재 AI adoption의 주요 바틀넥과 방향성에 대해 정리했습니다.

AI를 도입하고 싶은 엄청난 니즈 반면에 성공적인 AI 어플리케이션을 만들기 위한 QA 툴들이 현재로썬 너무 부족합니다. JooHo가 얘기한 것과 같이 스타트업이 태클하기 매우 좋은 문제가 아닌가 싶습니다.

AI 어플리케이션 시장의 진정한 개화를 위해선 Testing, Evaluation, Monitoring 툴 (포괄적으로 보면 AIOps 툴)이 선행되어야 한 다는 것에 대해 전적으로 동의합니다.

이 중요성을 인지하고 Enterprise SaaS 상장회사 중 탑10 멀티플 안에 드는 DataDog (NASDAQ: DDOG) 분석을 조만간 진행할 예정입니다.

이 글을 읽으면서 추가적으로 재미있는 2가지 생각이 났습니다:

Market Sizing

“Menlo Ventures estimated that enterprises spent approximately $2.5B on genAI in 2023, while all AI spend was about $70B. Since companies have allocated 20–30% of software development budgets for quality assurance and testing in the past, we can expect the LLM evaluation budget to be in the $500M-750M range.”

JooHo는 기존 소프트웨어 개발 프로세스에서 20~30%의 budget을 QA에 사용한 사례를 비추어 보수적으로 2023년 기준 시장기회는 $500~750M으로 볼 수 있다고 합니다. 당연히 Gen AI 시장이 커지면서 LLM QA에 대한 시장기회도 커지겠죠.

반면에 Andrej는 Software 2.0 블로그에서 이렇게 얘기했습니다:

This is fundamentally altering the programming paradigm by which we iterate on our software, as the teams split in two: the 2.0 programmers (data labelers) edit and grow the datasets, while a few 1.0 programmers maintain and iterate on the surrounding training code infrastructure, analytics, visualizations and labeling interfaces.

Software 2.0 시대엔 1.0 엔지니어에 대한 니즈가 적어집니다. 코드를 직접 짜는 것이 아니라 Neural Net을 학습시키고 올바르고 정확하게 가동(QA)시키는 것에 자원이 더욱 집중될 것으로 봅니다.

그렇다면 기존 1.0 시대에 사용한 budget의 20~30%보다 훨씬 큰 비율을 QA에 사용할 가능성이 높지 않을까요?

Regulation

Regulation은 AI 시장에 매우 큰 와일드카드입니다. Regulation의 필요성에 대한 의견은 분분합니다.

현재 여러 지정학적, 사회적 이유로 글로벌 정부의 AI에 대한 stance는 “more bark than bite”이지만 in one way or the other 무조건적으로 다가올 미래라고 생각합니다.

각 국가마다 다른 Regulation을 적용 받는 다는 것은 반대로 Localization의 기회가 열린다는 것이라고 생각합니다. LLM QA 시장에서도 해당 국가의 Regulation에 따라 별도의 서비스를 필요로 할 것으로 보이며 이는 국내 스타트업 중에서도 QA 시장을 먹을 수 있는 기회로도 보여집니다.

Jason 블로그 구독하기