Tools: To Use AWK to Manipulate Text in Linux How

Tools: To Use AWK to Manipulate Text in Linux How

Source: DigitalOcean

By Justin Ellingwood, Brian Hogan and Manikandan Kurup Linux utilities often follow the Unix philosophy of design. Tools are encouraged to be small, use plain text files for input and output, and operate in a modular manner. Because of this legacy, Linux provides powerful text processing tools like sed and awk. awk is both a programming language and a text processor, which makes it especially useful for working with structured, line-oriented data like logs, tables, and simple delimited files. In this guide, you will start with the basic pattern-and-action syntax, learn how it compares to sed for common text-processing tasks, and then build up to field-based filtering, internal variables, and formatted output. From there, you will use associative arrays to count, group, and aggregate values in one pass, combine awk with other commands in pipelines, and embed awk in shell scripts for repeatable automation. You will also walk through real-world parsing examples and end with FAQs that address common questions about awk. Both awk and sed are standard Unix text-processing tools that are often mentioned together because they operate on text streams. However, they are designed with different goals in mind, and choosing the right one depends on the type of task you are trying to accomplish. At a high level, sed is a stream editor that excels at making simple, line-oriented transformations, while awk is a pattern-scanning and data-processing language that is better suited for working with structured or column-based data. You should use sed when your task involves straightforward transformations applied to lines of text. It is especially useful when you want to modify text quickly without needing to understand its internal structure. For example, if you want to replace all occurrences of the word “error” with “warning” in a file, sed provides a concise and efficient solution: In general, sed is a good choice when you need to: Because sed operates on one line at a time and does not inherently understand fields or columns, it is less suitable for tasks that involve structured data. You should use awk when your task involves extracting, analyzing, or transforming structured data. Unlike sed, awk understands the concept of fields and can easily work with columns in a file. For example, if you want to print only the first and third columns from a file, awk makes this straightforward: awk is particularly useful when you need to: Because awk includes programming constructs such as variables, conditions, and loops, it can handle more complex data-processing tasks than sed. If you are unsure which tool to use, the following guidelines can help: Consider a file where each line contains multiple space-separated fields, and you want to extract only the first column. With awk, this task is simple and intuitive: With sed, achieving the same result requires a more complex regular expression: While both commands produce the same output, the awk version is easier to read and better reflects the structure of the data. This highlights a key distinction: awk is designed for working with fields, whereas sed operates purely on text patterns. Although sed and awk are both powerful tools for processing text, they are optimized for different use cases. sed is ideal for quick and simple text transformations, while awk is better suited for structured data processing and more complex logic. Understanding this distinction will help you choose the most efficient tool for your specific task. The awk command is included by default in all modern Linux systems, so you do not need to install it to begin using it. awk is most useful when handling text files that are formatted in a predictable way. For instance, it is excellent at parsing and manipulating tabular data. It operates on a line-by-line basis and iterates through the entire file. By default, it uses whitespace (spaces, tabs, etc.) to separate fields. Luckily, many configuration files on your Linux system use this format. The basic format of an awk command is: You can omit either the search portion or the action portion from any awk command. By default, the action taken if the “action” portion is not given is “print”. This simply prints all lines that match. If the search portion is not given, awk performs the action listed on each line. If both are given, awk uses the search portion to decide if the current line reflects the pattern, and then performs the actions on matches. In its simplest form, you can use awk like cat to print all lines of a text file out to the screen. Create a favorite_food.txt file that lists the favorite foods of a group of friends: Now use the awk command to print the file to the screen: You’ll see the file printed to the screen: This isn’t very useful. Let’s try out awk’s search filtering capabilities by searching through the file for the text “sand”: As you can see, awk now only prints the lines that have the characters “sand” in them. Using regular expressions, you can target specific parts of the text. To display only the line that starts with the letters “sand”, use the regular expression ^sand: This time, only one line is displayed: Similarly, you can use the action section to specify which pieces of information you want to print. For instance, to print only the first column, use the following command: You can reference every column (as delimited by whitespace) by variables associated with their column number. For example, the first column is $1, the second is $2, and you can reference the entire line with $0. The awk command uses some internal variables to assign certain pieces of information as it processes a file. The internal variables that awk uses are: You can change the values of these variables at will to match the needs of your files. Usually you do this during the initialization phase of your processing. For example, to print each line along with its line number, you can use: To display how many fields each line contains: This brings us to another important concept. The awk syntax is slightly more complex than what you’ve used so far. There are also optional BEGIN and END blocks that can contain commands to execute before and after the file processing, respectively. This makes our expanded syntax look something like this: The BEGIN and END keywords are specific sets of conditions, just like the search parameters. They match before and after the document has been processed. This means that you can change some of the internal variables in the BEGIN section. For instance, the /etc/passwd file is delimited with colons (:) instead of whitespace. To print out the first column of this file, execute the following command: You can use the BEGIN and END blocks to print information about the fields you are printing. Use the following command to transform the data from the file into a table, nicely spaced with tabs using \t: You’ll see this output: As you can see, you can format things quite nicely by taking advantage of some of awk’s features. Each of the expanded sections is optional. In fact, the main action section itself is optional if another section is defined. For example, you can do things like this: And you’ll see this output: Now let’s look at how to search for text within fields of the output. In one of the previous examples, you printed the line in the favorite_food.txt file that began with “sand”. This was easy because you were looking for the beginning of the entire line. What if you wanted to find out if a search pattern matched at the beginning of a field instead? Create a new version of the favorite_food.txt file that adds an item number in front of each person’s food: If you want to find all foods from this file that begin with “sa”, you might begin by trying something like this: This shows all lines that contain “sa”: Here, you are matching any instance of “sa” in the word. This ends up including things like “wasabi” which has the pattern in the middle, or “sandy” which is not in the column you want. In this case, you are only interested in words beginning with “sa” in the second column. You can tell awk to only match at the beginning of the second column by using this command: This allows you to search only at the beginning of the second column for a match. The field_num ~ part specifies that awk should only pay attention to the second column. You can just as easily search for things that do not match by including the “!” character before the tilde (~). This command will return all lines that do not have a food that starts with “sa”: If you want to find lines where the food does not start with “sa” and the item number is less than 5, you can use a compound expression like this: This introduces a few new concepts. The first is the ability to add additional requirements for the line to match by using the && operator. Using this, you can combine an arbitrary number of conditions for the line to match. In this case, you are using this operator to add a check that the value of the first column is less than 5. You’ll see this output: You can use awk to process files, but you can also work with the output of other programs. So far, you have used awk to filter and print data. One of the features that makes awk especially useful for real-world text processing is its support for associative arrays, which let you store values while you scan through a file. Unlike traditional arrays that use numeric indices, associative arrays in awk use string keys. This makes them a good fit for tasks like counting, grouping, and summarizing records, because you can use a field value (like a username or department) as the key. A common use case for associative arrays is counting how many times a value appears in a file. In this pattern, you increment a counter for each key as you read each line, and then print the final counts in an END block. Consider a file called fruits.txt: You can count how many times each fruit appears using: This produces output similar to: Depending on your awk implementation, the for (item in count) loop may print keys in an arbitrary order. If you need sorted output, you can pipe the results to sort. Associative arrays can also be used to group related data. Instead of storing a numeric count, you can build up a string (or other value) per key as you process each record. Suppose you have a file employees.txt: You can group employees by department: This approach avoids a leading space for the first employee in each department, which makes the output easier to read. You can also use associative arrays to calculate totals. This is a common pattern when your file contains a label in one field and a numeric value in another field. For example, consider a file sales.txt: To calculate total sales per person: Associative arrays can also help you identify maximum or minimum values. A typical workflow is to build an aggregate array first (such as totals per person), and then scan that array to find the largest value. This allows you to compute insights directly within awk, without needing external tools. Associative arrays transform awk from a simple text-filtering tool into a lightweight data-processing engine. With associative arrays, you can accumulate state as you read a file and produce summary output at the end. These capabilities are especially useful when working with logs, CSV files, or system-generated data. You can use the awk command to parse the output of other programs rather than specifying a filename. For example, you can use awk to parse out the IPv4 address from the ip command. The ip a command displays the IP address, broadcast address, and other information about all the network interfaces on your machine. To display the information for the interface called eth0, use this command: You’ll see the following results: You can use awk to target the inet line and then print out just the IP address: The -F flag tells awk to delimit by forward slashes or spaces using the regular expression [\/ ]+. This splits the line inet 172.17.0.11/16 into separate fields. The IP address is in the third field because the spaces at the start of the line also count as a field, since you delimited by spaces as well as slashes. Note that awk treated consecutive spaces as a single space in this case. The output shows the IP address: You’ll find many places where you can use awk to search or parse the output of other commands. So far, you have used awk as a standalone command. In practice, awk becomes even more useful when you embed it in shell scripts to automate repetitive data-processing tasks. By combining awk with Bash, you can build lightweight automation workflows that parse logs, extract metrics, and generate reports without adding extra dependencies. You can include awk commands directly inside a shell script just like any other command. This approach is useful when you want to reuse the same extraction logic in multiple places or run it on a schedule. For example, create a script called extract_users.sh: Make the script executable and run it: This script extracts and prints all usernames from the /etc/passwd file. While simple, it demonstrates how awk can be integrated into reusable scripts. It also shows a common pattern you will use often: -F ":" sets the field separator, and print $1 prints the first field from each line. You can pass variables from your shell script into awk using the -v option. This is useful when you want to make your scripts dynamic. This approach allows you to control awk behavior from your script logic. As a best practice, it is generally safer to pass shell variables into awk with -v than to try to concatenate them directly into the awk program. Shell scripts are often used to process multiple files. You can combine loops with awk to handle batches of data. This script iterates over all .log files in the directory, uses awk to extract lines containing “ERROR”, and prints the matching lines for each file. If you are using Bash and want the loop to skip cleanly when no files match *.log, you can enable nullglob (for example, shopt -s nullglob) so that the pattern expands to nothing instead of the literal string *.log. A common real-world use case is analyzing logs to extract useful information. Suppose you want to count how many errors occurred in a log file: This script scans the log file for lines containing “ERROR”, maintains a running count as it reads the file, and prints the total in an END block. The count+0 ensures that the script prints 0 even if no lines match. You can extend this further to: awk is often used in pipelines with other Unix tools to build flexible workflows. This pipeline filters HTTP 404 errors, extracts the IP address from the first field, and counts how many times each address appears. Although awk can often replace parts of this pipeline, combining tools can sometimes make scripts easier to understand and maintain. Using awk inside shell scripts allows you to: Because awk is available on almost all Unix-like systems, scripts that rely on it are also highly portable. While the previous examples demonstrate how awk works, its real strength lies in solving practical problems. System administrators and developers frequently use awk to parse logs, extract structured data, and generate quick reports directly from the command line. The following examples illustrate how awk can be applied to common real-world scenarios. Web server logs are one of the most common sources of structured text data. In many default Apache configurations, access logs are written in the “combined” format, which places the client IP address at the start of the line and the HTTP status code later in the record. Consider a typical Apache access log entry: If you want to extract the IP address and HTTP status code from each request, you can use: This works because $1 represents the client IP address and $9 represents the HTTP status code in the default access log layout shown above. To count how many requests resulted in a 404 error: The count+0 ensures that the output is 0 even when no lines match. Although awk defaults to whitespace as a field separator, it can easily handle CSV files by changing the delimiter. Consider a file data.csv: To print only the names and cities: This example assumes that the CSV file does not contain quoted fields with embedded commas. If your data includes quoted commas, you will need a CSV-aware parser. To filter records where age is greater than 30: This demonstrates how awk can be used to query structured datasets without needing a database. Many Linux system files follow predictable formats, making them ideal for processing with awk. For example, to list all users with a UID greater than 1000 from /etc/passwd: To count how many such users exist: This prints 0 if no user accounts match the condition. You can combine multiple awk features to generate quick summaries. For example, to count how many requests were made by each IP address: This produces a frequency distribution of requests per client. Keep in mind that for (ip in count) may print results in an arbitrary order, so you can pipe the output to sort if you need consistent ordering. By applying awk in these contexts, you can automate tasks that would otherwise require more complex tools or manual effort. awk is the POSIX-defined language and command interface. gawk (GNU Awk) is a common default on many Linux distributions and includes extra features beyond POSIX. mawk is another implementation that often prioritizes speed and low overhead, but it may not support the same GNU-specific extensions as gawk. You can set a custom field separator with the -F option, which tells awk how to split each input line into fields. For example, awk -F':' '{print $1}' /etc/passwd uses a colon delimiter and prints the first field (the username) from each line. sed is a stream editor that is best for line-oriented transformations such as substitutions, deletions, and inserts. awk is a pattern-scanning language that is designed for field-based processing, conditional logic, and simple calculations. In general, awk is a better fit for column-based data and reporting, while sed is often more direct for straightforward text substitutions. You can use an associative array as a counter, incrementing the value for each key as you read the file, then printing the results in an END block. For example, awk '{count[$1]++} END {for (w in count) print w, count[w]}' file.txt counts unique values in the first field. Yes. If you set the record separator RS to an empty string (RS=""), awk treats blank-line-separated blocks as a single record. This is useful when you are processing input that is structured in paragraphs or blocks rather than one record per line. You can pass shell variables into awk with the -v option, which assigns a value before processing begins. For example, threshold=100; awk -v t="$threshold" '$3 > t {print $0}' file.txt makes the value available inside awk as the variable t. Yes. awk processes input one record at a time and does not need to load the entire file into memory, which makes it efficient for large datasets. For very complex processing or multi-stage pipelines, other tools (such as Python) may be easier to maintain, but awk remains a strong option for fast extraction and aggregation. You can put an awk program in a plain text file (for example, script.awk) and run it with awk -f script.awk input.txt. This approach is easier to read and maintain than a long one-liner, especially when your program uses multiple rules, functions, or blocks. By now, you should have a basic understanding of how you can use the awk command to filter, format, and selectively print text from files and command output. You have also seen how awk goes beyond simple printing by using internal variables and BEGIN/END blocks for structured output, associative arrays for one-pass aggregation, and shell scripting patterns for repeatable automation. As you apply these techniques to real-world inputs like logs and delimited data, you will be able to choose between awk and sed based on whether your task is field-oriented data processing or line-oriented editing. To learn more about awk, you can read the free public-domain book by its creators which goes into much more detail. For more tutorials on how to manipulate text, check out the following articles: Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases. Learn more about our products Former Senior Technical Writer at DigitalOcean, specializing in DevOps topics across multiple Linux distributions, including Ubuntu 18.04, 20.04, 22.04, as well as Debian 10 and 11. Managed the Write for DOnations program, wrote and edited community articles, and makes things on the Internet. Expertise in DevOps areas including Linux, Ubuntu, Debian, and more. With over 6 years of experience in tech publishing, Mani has edited and published more than 75 books covering a wide range of data science topics. Known for his strong attention to detail and technical knowledge, Mani specializes in creating clear, concise, and easy-to-understand content tailored for developers. This textbox defaults to using Markdown to format your answer. You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link! Literally just became a member to thank you for this post. Thank you for explaining awk clearly and with plenty of examples. It helped me a lot! nice tut, thank you so much thanks for the tutorial. Just enough for me to start using it. Great tutorial. Just became member so I could thank you. This is an awesome, powerful tool. Thanks for the great tutorial! Very nicely explained. Could you please also add some details about field variables ($1, $2, $3 … etc.)? Please complete your information! Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation. Full documentation for every DigitalOcean product. The Wave has everything you need to know about building a business, from raising funding to marketing your product. Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter. New accounts only. By submitting your email you agree to our Privacy Policy Scale up as you grow — whether you're running one virtual machine or ten thousand. From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.