Tools: Useful Linux Commands For Data Engineers

Tools: Useful Linux Commands For Data Engineers

Source: Dev.to

Prerequisites ## Table of Contents ## File and Directory Operations ## File System and Storage Management ## File Attributes and Permissions ## User and Group Management ## Networking and Security ## File Compression and Encryption ## Editors ## File transfer commands ## Conclusion ## References Linux servers dominate as the most preferred compute environment for large scale data systems. Mastering Linux helps data engineers to efficiently manage data pipelines and process data. This article is a deep dive into the most useful Linux commands that are relevant for data engineering tasks. These commands are used for navigating and manipulating the Linux file system. They emulate CRUD operations, but on the file structure. Data engineers can use these commands to inspect logs and raw data dumps. The file system partitions, organizes and stores data on disk. cat /proc/partitions ~ view all storage disks and partitions recognized by the system fdisk /dev/sda ~ create, manage and delete partitions on /dev/sda sudo mkfs.xfs /dev/sdb1 ~ install a filesystem on partition /dev/sdb1 sudo mount /dev/sdb1 /myfolder ~ attach storage space to myfolder directory which acts as the mount point df -f /myfolder ~ verify the storage space exists pvcreate /dev/sdb1 ~ initialize a physical volume to use with the Logical Volume Manager As data engineers work on systems with intensive data, they need to skillfully manipulate partitions and filesystems to prevent losing data. Backing up data is recommended before running these commands. Linux file permissions offer a security mechanism for determining who can read, write or execute files on a system. Using these commands, a data engineer will be able restrict access to sensitive file data and protect files from accidental modifications by setting strict permissions. Use the useradd command to create a new user View user information with To add a password for the new user, run: All existing users will be listed in the /etc/passwd file Switch to a different user with In order for data engineers to maintain least privilege access to resources, they need to properly implement user and group management. It is recommended to use user accounts instead of the root user to minimize access. At the center of Linux are four key components that block malicious access - Firewalls filter traffic, Encryption encrypts data in transit, Authentication verifies user identities, Monitoring analyzes traffic. Using UFW which is a firewall interface for iptables, you can check the currently registered profiles with To enable an application profile, RUN There is a rule that allows you to specify the port instead of the profile name With the rules applied you can enable firewall with: use sudo ss --tulnp|grep to show listening ports and ping to test connectivity. Compressing a file reduces the amount of storage needed and will help speed up data transfer. Use gzip to compress a single file as: The process of archiving will combine multiple files into a single file archive. create a simple archive using tar package To extract an archive use the command: Encryption will ensure that you can safely transit the data over the internet. Use GPG or openssl for encryption Data engineers can compress raw log files to save space, then encrypt data archived files before uploading to cloud storage. The most popular text editors on Linux are Nano and Vim. Both commands will create if not exists, then open file.txt These commands can be used to handle file transfer either locally or remotely. A data engineer will be able to efficiently sync files on the local server to a remote server. This article covers several categories of Linux commands and how data engineers use them. How to Manage Linux Storage Linux file permissions explained Differences between archiving and compression Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: # returns the current working directory pwd # lists directory contents including files and subdirectories ls # includes hidden files in the list ls -a # changes the current working directory // cd /path/to/directory cd # deleting files and directories rm # creating a file touch # returns the contents of a file cat # quickly view the first or last lines of a file without opening the entire file. head / tail Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # returns the current working directory pwd # lists directory contents including files and subdirectories ls # includes hidden files in the list ls -a # changes the current working directory // cd /path/to/directory cd # deleting files and directories rm # creating a file touch # returns the contents of a file cat # quickly view the first or last lines of a file without opening the entire file. head / tail COMMAND_BLOCK: # returns the current working directory pwd # lists directory contents including files and subdirectories ls # includes hidden files in the list ls -a # changes the current working directory // cd /path/to/directory cd # deleting files and directories rm # creating a file touch # returns the contents of a file cat # quickly view the first or last lines of a file without opening the entire file. head / tail CODE_BLOCK: /dev ~ directory representing actual storage disks sd ~ the storage disk a ~ the first disk 1 ~ first partition on the disk Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: /dev ~ directory representing actual storage disks sd ~ the storage disk a ~ the first disk 1 ~ first partition on the disk CODE_BLOCK: /dev ~ directory representing actual storage disks sd ~ the storage disk a ~ the first disk 1 ~ first partition on the disk COMMAND_BLOCK: # returns file metadata including file permissions ls -l # makes a file immutable chattr +i # list file attributes lsattr # add or remove execute permissions chmod +x or -x # set Read/Write for owner, read for group/others chmod 644 myapp.py # change file ownership chown Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # returns file metadata including file permissions ls -l # makes a file immutable chattr +i # list file attributes lsattr # add or remove execute permissions chmod +x or -x # set Read/Write for owner, read for group/others chmod 644 myapp.py # change file ownership chown COMMAND_BLOCK: # returns file metadata including file permissions ls -l # makes a file immutable chattr +i # list file attributes lsattr # add or remove execute permissions chmod +x or -x # set Read/Write for owner, read for group/others chmod 644 myapp.py # change file ownership chown COMMAND_BLOCK: sudo useradd username #create new group sudo groupadd groupname # -m ~creates the home directory for the user sudo useradd -m username # assign the user to a specific group sudo useradd -G groupname $username Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo useradd username #create new group sudo groupadd groupname # -m ~creates the home directory for the user sudo useradd -m username # assign the user to a specific group sudo useradd -G groupname $username COMMAND_BLOCK: sudo useradd username #create new group sudo groupadd groupname # -m ~creates the home directory for the user sudo useradd -m username # assign the user to a specific group sudo useradd -G groupname $username COMMAND_BLOCK: id username # returns UID and GID Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: id username # returns UID and GID COMMAND_BLOCK: id username # returns UID and GID COMMAND_BLOCK: sudo passwd username Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo passwd username COMMAND_BLOCK: sudo passwd username CODE_BLOCK: cat /etc/passwd Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: cat /etc/passwd CODE_BLOCK: cat /etc/passwd CODE_BLOCK: su -h username Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: su -h username CODE_BLOCK: su -h username COMMAND_BLOCK: sudo ufw app list Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo ufw app list COMMAND_BLOCK: sudo ufw app list COMMAND_BLOCK: sudo ufw allow appname Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo ufw allow appname COMMAND_BLOCK: sudo ufw allow appname COMMAND_BLOCK: # check if ufw is running sudo ufw status # allow ssh sudo ufw allow 22 # specify port ranges (apply with specific protocol) sudo ufw allow 6000:6009 /tcp # allow connections from IP address sudo ufw allow from 201.8.139.4 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # check if ufw is running sudo ufw status # allow ssh sudo ufw allow 22 # specify port ranges (apply with specific protocol) sudo ufw allow 6000:6009 /tcp # allow connections from IP address sudo ufw allow from 201.8.139.4 COMMAND_BLOCK: # check if ufw is running sudo ufw status # allow ssh sudo ufw allow 22 # specify port ranges (apply with specific protocol) sudo ufw allow 6000:6009 /tcp # allow connections from IP address sudo ufw allow from 201.8.139.4 COMMAND_BLOCK: sudo ufw enable Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo ufw enable COMMAND_BLOCK: sudo ufw enable COMMAND_BLOCK: # results in filename.gz gzip filename Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # results in filename.gz gzip filename COMMAND_BLOCK: # results in filename.gz gzip filename COMMAND_BLOCK: # -cf creates and names the archive tar -cf myarchive.tar app.py main.py project/ Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # -cf creates and names the archive tar -cf myarchive.tar app.py main.py project/ COMMAND_BLOCK: # -cf creates and names the archive tar -cf myarchive.tar app.py main.py project/ CODE_BLOCK: tar -xf myarchive.tar Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: tar -xf myarchive.tar CODE_BLOCK: tar -xf myarchive.tar COMMAND_BLOCK: # File encrypted using passphrase gpg -c filename.txt # AES encryption openssl enc -aes-256-cbc -in data.txt -out data.enc Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # File encrypted using passphrase gpg -c filename.txt # AES encryption openssl enc -aes-256-cbc -in data.txt -out data.enc COMMAND_BLOCK: # File encrypted using passphrase gpg -c filename.txt # AES encryption openssl enc -aes-256-cbc -in data.txt -out data.enc CODE_BLOCK: nano file.txt vi file.txt Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: nano file.txt vi file.txt CODE_BLOCK: nano file.txt vi file.txt COMMAND_BLOCK: # Secure File Transfer Protocol ~ used for transferring files sftp username@ip_address/hostname # copy files and directories locally cp file1.txt /backup/file1.txt # move or rename files mv filename/ renamedfile/ # transfer files remotely over ssh scp data.csv username@ip_address:/data #synchronize directories remotely for incremental updates rsync -avz /backup/log username@ip_address/hostname Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Secure File Transfer Protocol ~ used for transferring files sftp username@ip_address/hostname # copy files and directories locally cp file1.txt /backup/file1.txt # move or rename files mv filename/ renamedfile/ # transfer files remotely over ssh scp data.csv username@ip_address:/data #synchronize directories remotely for incremental updates rsync -avz /backup/log username@ip_address/hostname COMMAND_BLOCK: # Secure File Transfer Protocol ~ used for transferring files sftp username@ip_address/hostname # copy files and directories locally cp file1.txt /backup/file1.txt # move or rename files mv filename/ renamedfile/ # transfer files remotely over ssh scp data.csv username@ip_address:/data #synchronize directories remotely for incremental updates rsync -avz /backup/log username@ip_address/hostname - Setup a Linux server environment for testing purposes. - Should be familiar with the command line - File and Directory Operations - File System and Storage Management - File Attributes and Permissions - User and Group Management - Networking and Security - File Compression and Encryption - File transfer commands