1.What is benchmark for AI?
A benchmark is essentially a target that the AI system must achieve. It is a method of specifying what you want your tool to achieve and then working towards that aim. One example is a collection containing almost 14 million photos. ImageNet is used by researchers to test their picture categorization methods. The objective is to correctly identify as many photos as possible.
2.How the result came out based on the research?
While some benchmarks do not achieve 90% accuracy, they do outperform the human baseline. The Visual Question Answering Challenge, for example, puts AI systems to the test with open-ended textual inquiries regarding images. This year, the top performing model had an accuracy of 84.3%. The human baseline is approximately 80%.
3.Why they think it is an opportunity?
Perhaps researchers require more comprehensive benchmarks—current benchmarks mostly test against a specific goal. But, as they progress towards AI tools that combine vision, language, and other capabilities, do we need benchmarks to assist them comprehend the tradeoffs between accuracy and prejudice or toxicity, for example? Can they consider more social factors? A great deal cannot be quantified. They believe this is an opportunity to reconsider what they want from these technologies.
They can point to HELM because they are close to the Centre for Research on Foundation Models. HELM, developed by CRFM researchers, examines many scenarios and tasks and is more thorough than previous benchmarks. It takes into account not just accuracy, but also fairness, toxicity, efficiency, resilience, and other factors. That is only one example. However, more of these approaches are required. Because benchmarks govern AI research, they must better reflect how we, as humans and as a society, want to interact with these technologies.