Computer scientists have long toyed with the idea of creating computers that could write programs for other computers. Artificial intelligence is an obvious technology for the task. It has been previously used for programming on a small scale but unfortunately the results have been limited.
Artificial intelligence is one of our most powerful and versatile technologies in use today. It can understand and generate speech, analyze documents, recognize images and characters, drive cars, pilot war planes, write papers, and perform thousands of other valuable operations.
Dr. Ruchir Puri is an IBM fellow and Chief Scientist for IBM Research. For the past two and a half years, Dr. Puri and a team from IBM Research and the MIT-IBM Watson AI Lab have worked on a massive code-intensive AI for Code project. At this year’s IBM Think 2021 conference, Arvind Krishna, Chairman and CEO of IBM, revealed the results of that two-year effort and announced a brilliant piece of research called Project CodeNet – a large dataset aimed at teaching AI to code.
CodeNet is similarly named after ImageNet, a groundbreaking AI for Computer Vision dataset. ImageNet contains over 14 million images scattered across 20,000 categories. ImageNet is the main reason for innovative progress that has been made in AI for Computer Vision algorithms and compute. CodeNet dataset likewise has the capability to drive innovation and growth in future AI for Code products.
CodeNet is unprecedented
There are several available datasets for source code, but most are only useful for a few targeted tasks. Two popular datasets, GCJ and POJ, are compared to CodeNet in the above chart. GCJ was collected from submissions to Google’s coding competition from 2008 to 2020. POJ-104 uses code samples collected from an educational website designed for programming competitions. While these datasets are narrow in scope, CodeNet is designed for broad and comprehensive use.
CodeNet is unprecedented. It is a large-scale open-source dataset with 14 million code samples and 500 million lines of code. CodeNet also has over 4,000 problem descriptions and 55 programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.
In addition to its rich variety of languages and problems, 80% of the problems have over a hundred solutions each. IBM has annotated the code samples with a rich collection of information, such as code size, memory footprint, CPU run time, and status, indicating acceptance or error types.
Modernizing our software infrastructure is essential.
IBM’s efforts to advance AI for Code is part of a larger objective. There is an urgent need to modernize legacy software infrastructure infused throughout the entirety of global enterprise and government. A large portion of today’s enterprise code originated from multi-year efforts of programming teams whose individual memberships fluctuated over time. As a result, a diverse amount of thought and planning went into building giant monoliths of code that now exist as legacy software. For those reasons, it is difficult for today’s developers to modernize legacy code or make changes to improve performance, accommodate new features, or realign it for new regulations.
COBOL, a legacy giant
There are many programs written in a legacy programming language called COBOL (an acronym for “common business-oriented language”). COBOL was developed and heavily promoted by the U.S. Department of Defense in the early 1960s. Over the years, it has grown into an industry standard. It is still used in many business applications, banking transactions and government administrative mainframes.
Because it is so widely used, manually upgrading COBOL programs would be a significant undertaking in human effort, money, and scope. COBOL was once considered an advanced programming language. However, as it is a large part of our outdated digital infrastructure, it is time for COBOL programs to be modernized.
Simple problem statement, but a huge task
The problem is that a huge legacy base of COBOL must be migrated to new platforms and rewritten into more contemporary programming languages. It’s a large task because IBM estimates about 200 to 220 billion lines of legacy COBOL code needs to be rewritten. At the cost of 32 to 50 cents a coded line using human resources, the modernization of COBOL represents a $100 billion problem. How long it would take for a complete rewrite using human programmers? That has yet to be defined.
The problem is further complicated because COBOL is less popular than it once was. Many skilled COBOL coders have already retired or are about to retire. The extent of a pronounced shortage in experienced COBOL programmers first became evident at the beginning of Covid-19.
Dr. Puri said, “Nothing brought the shortage of COBOL programmers to the forefront more than the pandemic. We are still in the middle of a pandemic in many parts of the world. Luckily, in the United States we are starting to turn the corner, but you still see many headlines that COBOL coders are needed for the Corona virus fight.”
Dr. Puri also pointed out another subtle but significant factor that affects the availability of skilled COBOL programmers. He said people entering the software engineering development area prefer not to work on COBOL projects. They would rather work with new languages instead.
One step at a time
Translating legacy code into a new language is complicated. Complexity stems from the contextual nature of programming languages. Most languages use statements to direct operations to be performed by the computer. Each statement is performed within the nuanced context of a set of conditions needed to carry out the operation. Context in human language may extend to a few paragraphs, but for code translations, context may be scattered across multiple libraries. Consequently, determining context is difficult for artificial intelligence. Moreover, the larger the program, the more difficult the translation becomes.
CodeNet comes with a set of developer tools and artificial intelligence models for code understanding which will continue to be enhanced with improved algorithms to extract context and translate from a legacy code to a modern language. For this, CodeNet has sample input and output test sets for over 7 million code samples.
Early success proves CodeNet’s capabilities and value
The IBM website highlights an early example of how CodeNet was used to modernize legacy code:
“For example, a large automotive client approached IBM to help update a $200 million asset consisting of 3,500, multi-generation Java files. These files consisted of more than one million lines of code, developed over a decade with multiple generations of Java technology.
It was a complex monolithic application code, not conducive to cloud-based environments. By applying our AI for Code stack, we reduced the business’s year-long ongoing code migration process down to just four weeks, modernizing and generating over 25 new cloud-native microservices by refactoring the legacy monolithic application code.”
According to Dr. Puri, CodeNet dataset can enable AI breakthroughs that will make developers more productive and contribute to the evolution of AI for Code that promotes the creation of innovative algorithms. “These capabilities all relate to AI’s ability to understand code,” he said. “In a broad sense, over time, AI will make developers more productive by handling routine tasks they now do on a regular basis.”
In the future, he says AI for Code will evolve like the path taken by AI for human language understanding, although in this case it will build upon the rich body of knowledge in program analysis. Technologies will be developed that transcend the diversity of programming languages through reasoning about the underlying control and data flow representations of the code. Rather than merely executing operating statements, artificial intelligence will become more useful as it acquires a deeper understanding of the essence of a programming language. Moreover, as AI begins to take on more and more of the routine day-to-day tasks, developers will move up to the next level and devote more time to creating new techniques and algorithms.
Here is what Dr. Puri believes software development will look like using breakthroughs enabled by CodeNet and its AI tools:
- If a code is not functional, AI will analyze the code and provide recommendations that the developer can use to make it functional, given the constraints of test data.
- If the code does not run fast enough, then AI will provide recommendations for higher performance to make the code run faster.
- AI will have the ability to improve a code’s memory consumption. If code consumes too much memory, AI will provide recommendations to reduce the code’s memory footprint.
- AI can search for similar codes. It will not be restricted to searching in the language of the program being coded. It can also search for similar code across multiple languages.
- Breakthroughs enabled by CodeNet will augment existing program analysis methods to deliver one of the most powerful features — developers will be able to translate code into many other languages.
In the future, can CodeNet AI program itself?
When asked if CodeNet AI might be able to program itself, Dr. Puri gave an interesting answer:
“I want to answer that question in two ways,” he said. “I do expect CodeNet AI to self-learn in terms of improvements it needs to have. I would say that might happen in a couple of years or so. Having said that, CodeNet and AI for Code is far, far away from coming up with new algorithms on its own. I do expect AI to have self-learning where it realizes the weaker parts of its training and automatically corrects those gaps by even synthesizing some of the data. Through algorithmic breakthroughs enabled by CodeNet, progress in AI for Code will be significantly accelerated, enabling developers to focus on more creative programming endeavors.”
- It is hard to overstate CodeNet’s long-term potential value. It is a major milestone for AI for Code. Just as ImageNet changed the trajectory of AI for Computer Vision, CodeNet will do likewise for AI for Code.
- A 2019 IEEE study showed that developers only spend 48% of their time on actual code development. By handling routine tasks, and providing tools and AI recommendations to improve code, CodeNet will make that 48% significantly more productive.
- It is important that IBM developed CodeNet as open source. That assures CodeNet will have community support for continual improvements and new algorithms. It will also accelerate adoption which in turn accelerates improvements.
- Using CodeNet on scale projects will result in the creation of more innovative algorithms, which in turn will further improve the efficiency of downstream coding projects.
- Given CodeNet’s anticipated efficiency and speed of translation, as bulk translation of legacy code gets underway, because the task is so large, it is likely to have a small, but measurable, impact on national productivity.
- Converting legacy code to newer and more manageable code can also have a positive impact on national defense and security as legacy systems are rewritten and replaced with faster, more efficient code that has a wider pool of support resources.
- By streamlining the process of converting legacy code into modern language, more skilled programmers will be available to maintain and update the translated code as well as create new code.
- I am expecting many CodeNet research papers to be published in the next six months. I believe CodeNet will create a high level of academic interest.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.
Moor Insights & Strategy, like all research and analyst firms, provides or has provided paid research, analysis, advising, or consulting to many high-tech companies in the industry, including 8×8, Advanced Micro Devices, Amazon, Applied Micro, ARM, Aruba Networks, AT&T, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, Calix, Cisco Systems, Clear Software, Cloudera, Clumio, Cognitive Systems, CompuCom, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Digital Optics, Dreamchain, Echelon, Ericsson, Extreme Networks, Flex, Foxconn, Frame (now VMware), Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Google (Nest-Revolve), Google Cloud, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, IBM, Ion VR, Inseego, Infosys, Intel, Interdigital, Jabil Circuit, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, MapBox, Marvell, Mavenir, Marseille Inc, Mayfair Equity, Meraki (Cisco), Mesophere, Microsoft, Mojo Networks, National Instruments, NetApp, Nightwatch, NOKIA (Alcatel-Lucent), Nortek, Novumind, NVIDIA, Nuvia, ON Semiconductor, ONUG, OpenStack Foundation, Oracle, Poly, Panasas, Peraso, Pexip, Pixelworks, Plume Design, Poly, Portworx, Pure Storage, Qualcomm, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Residio, Samsung Electronics, SAP, SAS, Scale Computing, Schneider Electric, Silver Peak, SONY, Springpath, Spirent, Splunk, Sprint, Stratus Technologies, Symantec, Synaptics, Syniverse, Synopsys, Tanium, TE Connectivity, TensTorrent, Tobii Technology, T-Mobile, Twitter, Unity Technologies, UiPath, Verizon Communications, Vidyo, VMware, Wave Computing, Wellsmith, Xilinx, Zebra, Zededa, and Zoho which may be cited in blogs and research.