UC Berkeley introduced Cybergum: A real-world cybersuCurity evaluation structure to evaluate AI agents on large-scale weaknesses in large-scale codbase

June 20, 2025

Dr. Ashish V

Cyber safety has become a significant area of interest in artificial intelligence, which is carried out by the growing dependence on larger software ftware systems and extended capabilities of AI tools. As the threats develop in complexity, the security of Software Ftware Systems has become more than a matter of traditional defense; It now intersects with automatic logic, weakness investigations, and cod-level understanding. Modern cyber safety requires tools and methods that can imitate real-world views, identify hidden errors, and validate system integrity in various software ftware infrastructures. Within this environment, researchers are developing benchmarks and methods to systematically evaluate the ability to understand, discover, and exploit the weaknesses of AI agents, drawing parallel with human security researchers. However, removing the distance between AI logic and cybercurity complications of the real-world is a major challenge.

Problem with the existing benchmark

A pressing issue lacks effective ways to evaluate whether AI systems are really capable of understanding and handling security functions in real conditions. Simple benchmark tasks often dominate the current testing methods, which rarely gives a mirror of the messy and layered reality of the software fertile repositories. This atmosphere includes complex input conditions, Deep Code Paths and micro-weaknesses that demand more than surface-level inspection. Without strong evaluation methods, it is difficult to determine that AI agents can be trusted to find weaknesses or perform tasks such as exploiting development. More importantly, the current benchmarks do not reflect the scale and nuns of the weaknesses found in the widely used Software Futware Systems, which are actively maintained, which leaves critical evaluation distance.

Current equipment limitations

Several benchmarks have been used to evaluate cyber safety capabilities, including the Cybanch and NYU CTF Bench. This focuses on capture-the-flag-style tasks that provide a limited complexity, which usually includes a small codebase and a blocked test environment. Some benchmarks try to include real-world weaknesses, but they often do this on a limited basis. Moreover, many tools depend on synthetic test cases or challenging problems, which fail to represent the variety of Software Fatware inputs, execution paths and types of bug found in real systems. Special agents created for security analysis have also been tested on the benchmark with only ten or a few hundred tasks, which is shortened by the complex of real-world threatening landscapes.

Cybergim presentation

Researchers introduced PuzzleA large-scale and comprehensive benchmarking tool is designed to evaluate AI agents, especially in real-world cyber security contexts. The University of California, developed in Berkeley, consisted of 1,507 separate benchmark functions in the 188 major open-sores S. Software Fatware Projects and received from the actual weaknesses. These weaknesses were originally identified by OSS-Faz, a continuous vague campaign maintained by Google. To ensure reality, each benchmark example includes a textual description of the full pre-patch codebase, executable and weakness. Agents must produce a proof-concept f-concept test that reproduces weakness in an unknown version, and evaluates success based on whether the cybergim pre-pouch stimulates weakness and absent in post-patch. This benchmark emphasizes the unique concepts of concepts (POCs), a function in which agents need to pass the complex code path and synthesize the inputs to meet certain security conditions. Cybergim is modular and containerized, enabling easy expansion and fertility.

Cybergim evaluation level

The evaluated pipeline in Cybergim is made around four layers of trouble, which increases the amount of information provided each provided. At level 0, the agent is only given a codes which has no indication of weakness. Level 1 will add a description of the natural language. Level 2 represents the ground-truth proof of a concept (POC) and crash stack trace, while level 3 contains the patch itself and the codbase after the patch. Each level represents a new level of logic and complexity. For example, in Level 1, agents must fully guess the location and context of weakness with its textual description and codebase. To ensure the quality of the benchmark, Cybergim filters apply such as patch commit messages, to validate the proof-concept f-concept (POC) fertility and remove redundancy by comparing the stack trace. The final dataset contains a codbase with an average of 1,117 files and 387,491 lines of code, which contains over 40,000 files and 7 million lines code. Patch size also varies, changes in the middle of 1 file and seven lines, but sometimes 40 files and 3,000 lines are spread. Weaknesses target different crash types, with 30.4% of AP related to reading Gal-Buffer-overflow and 19.0% due to the use of untillized value.

Experimental consequences

When tested against this benchmark, existing agents showed limited success. In four agents frameworks, openhands, codex, anigma and cybanche, the top presentation was openhands together with Cloud -3.7 -Sonnet, which made only 11.9% of the target’s weaknesses. This operation was significantly reduced when dealing with POC inputs, as success rates for POCs under 10 bytes (43.5%) were highest and fell below 8% for a length of more than 100 bytes. Open-Serce models like Deepsk-V3, with only 3.6% success rate, are behind. Even the specific models of codes such as SWE-Jim-32B and R2A-32B failed to normalize, scoring under 2%. Surprisingly, the more rich input information influence at more difficult levels increased: Level 3 achieved 17.1% success, while level 0 receives only 3.5%. The analysis also revealed that most successful POC reproduction took place between 20 to 40 execution steps, with many runs more than 90 steps and eventually fail. Despite these challenges, the agents showed the previous unknown zero-day weaknesses and two and two declared but unknown people in real-world projects showed their dormant ability to discover a novel.

Key remedy

Benchmark Volume and Reality: Cybergim contains 1,507 tasks obtained from real, patched weaknesses in 188S Software Fatware Projects, making it the largest and most real benchmark of its kind.
Agent Limitations: The best performing agent-model combination also resumes only 11.9% weaknesses, with many compounds scoring under 5%.
Difficulty scaling: Providing additional inputs such as stack trace or patches, significantly improved operations, level 3 functions give 17.1% success rate.
Length sensitivity: Agents struggle with tasks associated with long POC. The POC of over 100 bytes of 65.7% of the dataset had the lowest success rates.
Discovery Potential: 15 New zero-day weaknesses were discovered by agent-generated POC .It recognized their potential use in real-world security analysis.
Model Behavior: With a decreasing return after 80 steps, most successful tasks were produced at the beginning of the implementation of the work.
Tool Interactions: Agents performed better when allowed to adapt to the POC based on the tools (eg, ‘Avak’, ‘Grape’, using ‘XXD’) and runtime response.

End

In conclusion, this study illuminates an important problem: evaluation of AI in cyber safety is not only challenging, but it is necessary to understand its limitations and abilities. Sybergim is the stands to do so on a large-scale, real-world framework. Researchers have addressed this issue with a practical and detailed benchmark that compels agents to be adapted to the full codebase, to make the logic, generate valid absorption and repeat. The results make it clear that when current agents show promise, especially in finding new errors, there is still a long way to reliably contribute to cyber security on the scale.

Check Paper, Gittub page, Leaderboard. All credit for this research goes to researchers of this project. Also, feel free to follow us Twitter And don’t forget to join us 100 k+ ml subredit And subscribe Our newsletter.

Asif Razzaq is the CEO of MarketechPost Media Inc. as a visionary entrepreneur and engineer, Asif is committed to increasing the possibility of artificial intelligence for social good. Their most recent effort is the inauguration of the artificial intelligence media platform, MarktecPost, for its depth of machine learning and deep learning news for its depth of coverage .This is technically sound and easily understandable by a large audience. The platform has more than 2 million monthly views, showing its popularity among the audience.

UC Berkeley introduced Cybergum: A real-world cybersuCurity evaluation structure to evaluate AI agents on large-scale weaknesses in large-scale codbase

Dr. Ashish V

Problem with the existing benchmark

Current equipment limitations

Cybergim presentation

Cybergim evaluation level

Experimental consequences

Key remedy

End

You might also enjoy

IOS 26 Hidden AI Chatbot раскрыт

＃291-喬納森·海特（Jonathan Haidt）：反對社交媒體的案件

Subscribe Our Newsletter

Deep See K.A.R.T. S.Pace

UC Berkeley introduced Cybergum: A real-world cybersuCurity evaluation structure to evaluate AI agents on large-scale weaknesses in large-scale codbase

Dr. Ashish V

Problem with the existing benchmark

Current equipment limitations

Cybergim presentation

Cybergim evaluation level

Experimental consequences

Key remedy

End

You might also enjoy

IOS 26 Hidden AI Chatbot раскрыт

＃291-喬納森·海特（Jonathan Haidt）：反對社交媒體的案件

Claude 4 AI: Coding Genius or Overhyped? We Ran 3 Brutal Graduate-Level Tests

Subscribe Our Newsletter

Deep See K.A.R.T. S.Pace